Model Deployment : Detecting and Analyzing Machine Learning Model Drift Using Open-Source Monitoring Tools¶


John Pauline Pineda

October 31, 2025


  • 1. Table of Contents
    • 1.1 Data Background
    • 1.2 Data Description
    • 1.3 Data Quality Assessment
    • 1.4 Data Preprocessing
      • 1.4.1 Data Splitting
      • 1.4.2 Outlier and Distributional Shape Analysis
      • 1.4.3 Collinearity
    • 1.5 Data Exploration
      • 1.5.1 Exploratory Data Analysis
      • 1.5.2 Hypothesis Testing
    • 1.6 Premodelling Data Preparation
      • 1.6.1 Preprocessed Data Description
      • 1.6.2 Preprocessing Pipeline Development
    • 1.7 Model Development and Validation
      • 1.7.1 Random Forest
      • 1.7.2 AdaBoost
      • 1.7.3 Gradient Boosting
      • 1.7.4 XGBoost
      • 1.7.5 Light GBM
      • 1.7.6 CatBoost
    • 1.8 Model Selection
    • 1.9 Model Monitoring using the NannyML Framework
      • 1.9.1 Simulated Baseline Control
      • 1.9.2 Simulated Covariate Drift
      • 1.9.3 Simulated Prior Shift
      • 1.9.4 Simulated Concept Drift
      • 1.9.5 Simulated Missingness Spike
      • 1.9.6 Simulated Seasonal Pattern
    • 1.10 Consolidated Findings
  • 2. Summary
  • 3. References

1. Table of Contents ¶

This project investigates open-source frameworks for post-deployment model monitoring and performance estimation, with a particular focus on NannyML or detecting and interpreting shifts in machine learning pipelines using Python. The objective was to systematically analyze how different types of drift and distribution changes manifest after model deployment, and to demonstrate how robust monitoring mitigates risks of performance degradation and biased decision-making. The workflow began with the development and selection of a baseline predictive model, which serves as a reference for stability. The dataset was then deliberately perturbed to simulate a range of realistic post-deployment scenarios: Covariate Drift (shifts in feature distributions), Prior Shift (changes in target label proportions), Concept Drift (evolving relationships between features and outcomes), Missingness Spikes (abrupt increases in absent data), and Seasonal Patterns (periodic variations in distributions). NannyML’s statistical tests, visualization capabilities, and performance estimation methods were subsequently applied to diagnose these shifts, evaluate their potential impact, and provide interpretable insights into model reliability. By contrasting baseline and perturbed conditions, the experiment demonstrated how continuous monitoring augments traditional offline evaluation, offering a safeguard against hidden risks. The findings highlighted how tools like NannyML can integrate seamlessly into MLOps workflows to enable proactive governance, early warning systems, and sustainable deployment practices. All results were consolidated in a Summary presented at the end of the document.

Post-Deployment Monitoring refers to the continuous oversight of machine learning models once they are integrated into production systems. Unlike offline evaluation, which relies on static validation datasets, monitoring addresses the challenges of evolving real-world data streams where underlying distributions may shift. Effective monitoring ensures that models remain accurate, unbiased, and aligned with business objectives. In MLOps, monitoring encompasses data integrity checks, drift detection, performance estimation, and alerting mechanisms. NannyML operationalizes this concept by focusing on performance estimation without ground truth, and by offering statistical methods to detect when data or predictions deviate from expected baselines. The challenges of post-deployment monitoring include delayed or missing ground truth labels, non-stationary data, hidden feedback loops, and difficulties distinguishing natural fluctuations from problematic drifts. Common solutions involve deploying drift detection algorithms, conducting regular audits of data pipelines, simulating counterfactuals, and retraining models on updated data. Monitoring frameworks must balance sensitivity (detecting real problems quickly) with robustness (avoiding false alarms caused by natural noise). Another key challenge is explainability: stakeholders need interpretable signals that justify interventions such as retraining or rolling back models. Tools like NannyML address these challenges through statistical tests for data drift, performance estimation without labels, missingness tracking, and visual diagnostics, making monitoring actionable for data scientists and business teams alike.

Covariate Drift occurs when the distribution of input features changes over time compared to the data used to train the model. Also known as data drift, it does not necessarily imply that the model’s predictive mapping is invalid, but it often precedes performance degradation. Detecting covariate drift requires comparing feature distributions between baseline (reference) data and incoming production data. NannyML provides multiple statistical tests and visualization tools to flag significant changes. Key signatures of covariate drift include shifts in summary statistics (mean, variance), changes in distributional shape, or increased divergence between reference and production feature distributions. These shifts may lead to poor generalization, as the model has not been exposed to the altered feature ranges. Detection techniques include univariate statistical tests (e.g., Kolmogorov–Smirnov, Chi-square), multivariate distance measures (e.g., Jensen–Shannon divergence, Population Stability Index), and density estimation methods. Remediation approaches involve domain adaptation, re-weighting training samples, or retraining models on updated data distributions. NannyML implements univariate and multivariate tests, provides drift magnitude quantification, and visualizes feature-level changes, allowing practitioners to pinpoint which features are most responsible for the detected drift.

Prior Shift arises when the distribution of the target variable changes, while the conditional relationship between features and labels remains stable. This is also referred to as label shift. Models trained on the original distribution may underperform because their predictions no longer match the new class priors. Detecting prior shifts is crucial, especially in imbalanced classification tasks where small changes in priors can lead to large performance impacts. Prior shift is typically characterized by systematic increases or decreases in class frequencies without corresponding changes in feature distributions. Its impact includes skewed decision thresholds, inflated false positives/negatives, and degraded calibration of predicted probabilities. Detection approaches include monitoring predicted class proportions, estimating priors using EM-based algorithms, and re-weighting predictions to align with new distributions. Correction strategies may involve resampling, threshold adjustment, or cost-sensitive learning. NannyML assists by tracking predicted probability distributions and comparing them against reference priors, using techniques such as KL divergence and PSI to quantify the magnitude of shift.

Concept Drift occurs when the underlying relationship between input features and target labels evolves over time. Unlike covariate drift, where features change independently, concept drift implies that the model’s mapping function itself becomes outdated. Concept drift is among the most damaging forms of drift because it directly undermines predictive accuracy. Detecting it often requires monitoring model outputs or inferred performance over time. NannyML addresses this by estimating performance even when ground truth labels are unavailable. Concept drift is typically signaled by a gradual or sudden decline in performance metrics, inconsistent error patterns, or misalignment between expected and actual prediction behavior. Its impact is severe: models may lose predictive power entirely if they cannot adapt. Detection methods include window-based performance monitoring, hypothesis testing, adaptive ensembles, and statistical monitoring of residuals. Corrective actions include periodic retraining, incremental learning, and online adaptation strategies. NannyML leverages Confidence-Based Performance Estimation (CBPE) and other statistical techniques to estimate performance degradation without labels, making it possible to detect concept drift in real-time production environments.

Missingness Spike refers to sudden increases in missing values within production data. Missing features can destabilize preprocessing pipelines, distort predictions, and signal upstream data collection failures. Monitoring missingness is critical for ensuring both model reliability and data pipeline health. NannyML provides built-in mechanisms to track and visualize changes in missing data patterns, alerting stakeholders before downstream impacts occur. Key indicators of missingness spikes include abrupt rises in null counts, missing categorical levels, or structural breaks in feature completeness. The consequences range from biased predictions to outright system failures if preprocessing pipelines cannot handle unexpected missingness. Detection methods include statistical monitoring of missing value proportions, anomaly detection on completeness metrics, and threshold-based alerts. Solutions typically involve robust imputation, pipeline hardening, and upstream data validation. NannyML offers automated missingness detection, completeness trend visualization, and configurable thresholds, ensuring that missingness issues are surfaced early.

Seasonal Pattern Shift represents periodic fluctuations in data distributions or outcomes that follow predictable cycles. If models are not trained with sufficient historical data to capture these patterns, their predictions may systematically underperform during certain periods. NannyML’s monitoring can reveal recurring deviations, helping teams distinguish between natural seasonality and genuine drift that requires retraining. Seasonality is often characterized by cyclic patterns in data features, prediction distributions, or performance metrics. Its impact includes systematic biases, recurring error peaks, and difficulty distinguishing drift from natural variability. Detection techniques include autocorrelation analysis, Fourier decomposition, and seasonal-trend decomposition. Mitigation strategies involve training with longer historical datasets, adding time-related features, or developing seasonally adaptive models. NannyML highlights recurring deviations in drift metrics, making it easier for practitioners to separate cyclical behavior from true degradation, ensuring that alerts are contextually relevant.

Performance Estimation Without Labels refers to scenarios in real-world deployments where the ground truth often arrives with delays—or may never be available. This makes direct performance tracking difficult. NannyML addresses this challenge by providing algorithms to estimate model performance without labels using confidence distributions, statistical inference, and robust estimation techniques. This capability allows practitioners to maintain visibility into model health continuously, even in label-scarce settings, bridging a critical gap in MLOps monitoring practices. Algorithms in this domain include Confidence-Based Performance Estimation (CBPE), which infers performance by comparing predicted probability distributions against expected confidence intervals, and Direct Loss Estimation, which approximates error rates based on calibration. Statistical inference techniques allow practitioners to construct confidence bounds around estimated metrics, while robust estimation mitigates the risk of spurious signals caused by small sample sizes or noisy predictions. NannyML provides implementations of CBPE and DLE, supporting metrics such as precision, recall, F1-score, and AUROC, all estimated without labels. This makes it possible to detect when a model is underperforming even before labels are collected, reducing blind spots in production monitoring.

Performance Estimation With Labels refers to the direct evaluation of model predictions against actual ground truth outcomes once labels are available. Unlike label-free methods, this approach allows for precise calculation of traditional performance metrics such as accuracy, precision, recall, F1-score, AUROC, and calibration error. Monitoring with labels provides the most reliable indication of model performance, enabling fine-grained diagnosis of errors and biases. The advantage of having labels is the ability to attribute errors to specific subgroups, detect fairness violations, and conduct targeted retraining. Challenges include label delay, annotation quality, and ensuring that labels accurately reflect the operational environment. Common approaches include sliding window evaluation, where performance is tracked over recent data batches, and benchmark comparison, where production metrics are compared to baseline test set results. NannyML incorporates labeled performance tracking alongside its label-free estimators, allowing users to validate estimates once ground truth becomes available. This dual capability ensures consistency, improves confidence in label-free methods, and provides a comprehensive framework for performance monitoring in both short-term and long-term horizons.

1.1. Data Background ¶

An open Breast Cancer Dataset from Kaggle (with all credits attributed to Wasiq Ali) was used for the analysis as consolidated from the following primary sources:

  1. Reference Repository entitled Differentiated breast Cancer Recurrence from UC Irvine Machine Learning Repository
  2. Research Paper entitled Nuclear Feature Extraction for Breast Tumor Diagnosis from the Electronic Imaging

This study hypothesized that the cell nuclei features derived from digitized images of fine needle aspirates (FNA) of breast masses influence breast cancer diagnoses between patients.

The dichotomous categorical variable for the study is:

  • diagnosis - Status of the patient (M, Medical diagnosis of a cancerous breast tumor | B, Medical diagnosis of a non-cancerous breast tumor)

The predictor variables for the study are:

  • radius_mean - Mean of the radius measurements (Mean of distances from center to points on the perimeter)
  • texture_mean - Mean of the texture measurements (Standard deviation of grayscale values)
  • perimeter_mean - Mean of the perimeter measurements
  • area_mean - Mean of the area measurements
  • smoothness_mean - Mean of the smoothness measurements (Local variation in radius lengths)
  • compactness_mean - Mean of the compactness measurements (Perimeter² / area - 1.0)
  • concavity_mean - Mean of the concavity measurements (Severity of concave portions of the contour)
  • concave points_mean - Mean of the concave points measurements (Number of concave portions of the contour)
  • symmetry_mean - Mean of the symmetry measurements
  • fractal_dimension_mean - Mean of the fractal dimension measurements (Coastline approximation - 1)
  • radius_se - Standard error of the radius measurements (Standard error of distances from center to points on the perimeter)
  • texture_se - Standard error of the texture measurements (Standard deviation of grayscale values)
  • perimeter_se - Standard error of the perimeter measurements
  • area_se - Standard error of the area measurements
  • smoothness_se - Standard error of the smoothness measurements (Local variation in radius lengths)
  • compactness_se - Standard error of the compactness measurements (Perimeter² / area - 1.0)
  • concavity_se - Standard error of the concavity measurements (Severity of concave portions of the contour)
  • concave points_se - Standard error of the concave points measurements (Number of concave portions of the contour)
  • symmetry_se - Standard error of the symmetry measurements
  • fractal_dimension_se - Standard error of the fractal dimension measurements (Coastline approximation - 1)
  • radius_worst - Largest value of the radius measurements (Largest value of distances from center to points on the perimeter)
  • texture_worst - Largest value of the texture measurements (Standard deviation of grayscale values)
  • perimeter_worst - Largest value of the perimeter measurements
  • area_worst - Largest value of the area measurements
  • smoothness_worst - Largest value of the smoothness measurements (Local variation in radius lengths)
  • compactness_worst - Largest value of the compactness measurements (Perimeter² / area - 1.0)
  • concavity_worst - Largest value of the concavity measurements (Severity of concave portions of the contour)
  • concave points_worst - Largest value of the concave points measurements (Number of concave portions of the contour)
  • symmetry_worst - Largest value of the symmetry measurements
  • fractal_dimension_worst - Largest value of the fractal dimension measurements (Coastline approximation - 1)

1.2. Data Description ¶

  1. The initial tabular dataset was comprised of 569 observations and 32 variables (including 1 metadata, 1 target and 30 predictors).
    • 569 rows (observations)
    • 32 columns (variables)
      • 1/32 metadata (categorical)
        • id
      • 1/32 target (categorical)
        • diagnosis
      • 30/32 predictor (numeric)
        • radius_mean
        • texture_mean
        • perimeter_mean
        • area_mean
        • smoothness_mean
        • compactness_mean
        • concavity_mean
        • concave points_mean
        • symmetry_mean
        • fractal_dimension_mean
        • radius_se
        • texture_se
        • perimeter_se
        • area_se
        • smoothness_se
        • compactness_se
        • concavity_se
        • concave points_se
        • symmetry_se
        • fractal_dimension_se
        • radius_worst
        • texture_worst
        • perimeter_worst
        • area_worst
        • smoothness_worst
        • compactness_worst
        • concavity_worst
        • concave points_worst
        • symmetry_worst
        • fractal_dimension_worst
  2. The id variable was transformed to a row index for the data observations.
In [1]:
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
import joblib
import re
import pickle
%matplotlib inline

import nannyml as nml
from nannyml.performance_estimation import CBPE
from nannyml.chunk import DefaultChunker

import hashlib
import json
from urllib.parse import urlparse
import logging

from operator import truediv
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.decomposition import PCA
from scipy import stats
from scipy.stats import pointbiserialr, chi2_contingency

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, ParameterGrid, StratifiedShuffleSplit, RepeatedStratifiedKFold, GridSearchCV
from sklearn.utils import resample
from sklearn.base import clone

import warnings
warnings.filterwarnings("ignore", message=".*force_all_finite.*")
warnings.filterwarnings("ignore", message="X does not have valid feature names")
In [2]:
##################################
# Defining file paths
##################################
DATASETS_ORIGINAL_PATH = r"datasets\original"
DATASETS_FINAL_PATH = r"datasets\final\complete"
DATASETS_FINAL_TRAIN_PATH = r"datasets\final\train"
DATASETS_FINAL_TRAIN_FEATURES_PATH = r"datasets\final\train\features"
DATASETS_FINAL_TRAIN_TARGET_PATH = r"datasets\final\train\target"
DATASETS_FINAL_VALIDATION_PATH = r"datasets\final\validation"
DATASETS_FINAL_VALIDATION_FEATURES_PATH = r"datasets\final\validation\features"
DATASETS_FINAL_VALIDATION_TARGET_PATH = r"datasets\final\validation\target"
DATASETS_FINAL_TEST_PATH = r"datasets\final\test"
DATASETS_FINAL_TEST_FEATURES_PATH = r"datasets\final\test\features"
DATASETS_FINAL_TEST_TARGET_PATH = r"datasets\final\test\target"
DATASETS_PREPROCESSED_PATH = r"datasets\preprocessed"
DATASETS_PREPROCESSED_TRAIN_PATH = r"datasets\preprocessed\train"
DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH = r"datasets\preprocessed\train\features"
DATASETS_PREPROCESSED_TRAIN_TARGET_PATH = r"datasets\preprocessed\train\target"
DATASETS_PREPROCESSED_VALIDATION_PATH = r"datasets\preprocessed\validation"
DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH = r"datasets\preprocessed\validation\features"
DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH = r"datasets\preprocessed\validation\target"
DATASETS_PREPROCESSED_TEST_PATH = r"datasets\preprocessed\test"
DATASETS_PREPROCESSED_TEST_FEATURES_PATH = r"datasets\preprocessed\test\features"
DATASETS_PREPROCESSED_TEST_TARGET_PATH = r"datasets\preprocessed\test\target"
MODELS_PATH = r"models"
In [3]:
##################################
# Loading the dataset
# from the DATASETS_ORIGINAL_PATH
##################################
breast_cancer = pd.read_csv(os.path.join("..", DATASETS_ORIGINAL_PATH, "Breast_Cancer_Dataset.csv"))
In [4]:
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(breast_cancer.shape)
Dataset Dimensions: 
(569, 32)
In [5]:
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(breast_cancer.dtypes)
Column Names and Data Types:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
dtype: object
In [6]:
##################################
# Setting the ID column as row names
##################################
breast_cancer = breast_cancer.set_index("id")
In [7]:
##################################
# Taking a snapshot of the dataset
##################################
breast_cancer.head()
Out[7]:
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
id
842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 31 columns

In [8]:
##################################
# Performing a general exploration of the numeric variables
##################################
print('Numeric Variable Summary:')
display(breast_cancer.describe(include='number').transpose())
Numeric Variable Summary:
count mean std min 25% 50% 75% max
radius_mean 569.0 14.127292 3.524049 6.981000 11.700000 13.370000 15.780000 28.11000
texture_mean 569.0 19.289649 4.301036 9.710000 16.170000 18.840000 21.800000 39.28000
perimeter_mean 569.0 91.969033 24.298981 43.790000 75.170000 86.240000 104.100000 188.50000
area_mean 569.0 654.889104 351.914129 143.500000 420.300000 551.100000 782.700000 2501.00000
smoothness_mean 569.0 0.096360 0.014064 0.052630 0.086370 0.095870 0.105300 0.16340
compactness_mean 569.0 0.104341 0.052813 0.019380 0.064920 0.092630 0.130400 0.34540
concavity_mean 569.0 0.088799 0.079720 0.000000 0.029560 0.061540 0.130700 0.42680
concave points_mean 569.0 0.048919 0.038803 0.000000 0.020310 0.033500 0.074000 0.20120
symmetry_mean 569.0 0.181162 0.027414 0.106000 0.161900 0.179200 0.195700 0.30400
fractal_dimension_mean 569.0 0.062798 0.007060 0.049960 0.057700 0.061540 0.066120 0.09744
radius_se 569.0 0.405172 0.277313 0.111500 0.232400 0.324200 0.478900 2.87300
texture_se 569.0 1.216853 0.551648 0.360200 0.833900 1.108000 1.474000 4.88500
perimeter_se 569.0 2.866059 2.021855 0.757000 1.606000 2.287000 3.357000 21.98000
area_se 569.0 40.337079 45.491006 6.802000 17.850000 24.530000 45.190000 542.20000
smoothness_se 569.0 0.007041 0.003003 0.001713 0.005169 0.006380 0.008146 0.03113
compactness_se 569.0 0.025478 0.017908 0.002252 0.013080 0.020450 0.032450 0.13540
concavity_se 569.0 0.031894 0.030186 0.000000 0.015090 0.025890 0.042050 0.39600
concave points_se 569.0 0.011796 0.006170 0.000000 0.007638 0.010930 0.014710 0.05279
symmetry_se 569.0 0.020542 0.008266 0.007882 0.015160 0.018730 0.023480 0.07895
fractal_dimension_se 569.0 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
radius_worst 569.0 16.269190 4.833242 7.930000 13.010000 14.970000 18.790000 36.04000
texture_worst 569.0 25.677223 6.146258 12.020000 21.080000 25.410000 29.720000 49.54000
perimeter_worst 569.0 107.261213 33.602542 50.410000 84.110000 97.660000 125.400000 251.20000
area_worst 569.0 880.583128 569.356993 185.200000 515.300000 686.500000 1084.000000 4254.00000
smoothness_worst 569.0 0.132369 0.022832 0.071170 0.116600 0.131300 0.146000 0.22260
compactness_worst 569.0 0.254265 0.157336 0.027290 0.147200 0.211900 0.339100 1.05800
concavity_worst 569.0 0.272188 0.208624 0.000000 0.114500 0.226700 0.382900 1.25200
concave points_worst 569.0 0.114606 0.065732 0.000000 0.064930 0.099930 0.161400 0.29100
symmetry_worst 569.0 0.290076 0.061867 0.156500 0.250400 0.282200 0.317900 0.66380
fractal_dimension_worst 569.0 0.083946 0.018061 0.055040 0.071460 0.080040 0.092080 0.20750

1.3. Data Quality Assessment ¶

Data quality findings based on assessment are as follows:

  1. No duplicated rows were noted.
  2. No missing data noted for any variable with Null.Count>0 and Fill.Rate<1.0.
  3. No low variance observed for any variable with First.Second.Mode.Ratio>5.
  4. No low variance observed for any variable with Unique.Count.Ratio>10.
  5. High skewness observed for 5 variables with Skewness>3 or Skewness<(-3).
    • area_se: Skewness = 5.447
    • concavity_se: Skewness = 5.110
    • fractal_dimension_se: Skewness = 3.923
    • perimeter_se: Skewness = 3.443
    • radius_se: Skewness = 3.088
In [9]:
##################################
# Counting the number of duplicated rows
##################################
breast_cancer.duplicated().sum()
Out[9]:
np.int64(0)
In [10]:
##################################
# Gathering the data types for each column
##################################
data_type_list = list(breast_cancer.dtypes)
In [11]:
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(breast_cancer.columns)
In [12]:
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(breast_cancer)] * len(breast_cancer.columns))
In [13]:
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(breast_cancer.isna().sum(axis=0))
In [14]:
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(breast_cancer.count())
In [15]:
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
In [16]:
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
                                              data_type_list,
                                              row_count_list,
                                              non_null_count_list,
                                              null_count_list,
                                              fill_rate_list), 
                                        columns=['Column.Name',
                                                 'Column.Type',
                                                 'Row.Count',
                                                 'Non.Null.Count',
                                                 'Null.Count',                                                 
                                                 'Fill.Rate'])
display(all_column_quality_summary)
Column.Name Column.Type Row.Count Non.Null.Count Null.Count Fill.Rate
0 diagnosis object 569 569 0 1.0
1 radius_mean float64 569 569 0 1.0
2 texture_mean float64 569 569 0 1.0
3 perimeter_mean float64 569 569 0 1.0
4 area_mean float64 569 569 0 1.0
5 smoothness_mean float64 569 569 0 1.0
6 compactness_mean float64 569 569 0 1.0
7 concavity_mean float64 569 569 0 1.0
8 concave points_mean float64 569 569 0 1.0
9 symmetry_mean float64 569 569 0 1.0
10 fractal_dimension_mean float64 569 569 0 1.0
11 radius_se float64 569 569 0 1.0
12 texture_se float64 569 569 0 1.0
13 perimeter_se float64 569 569 0 1.0
14 area_se float64 569 569 0 1.0
15 smoothness_se float64 569 569 0 1.0
16 compactness_se float64 569 569 0 1.0
17 concavity_se float64 569 569 0 1.0
18 concave points_se float64 569 569 0 1.0
19 symmetry_se float64 569 569 0 1.0
20 fractal_dimension_se float64 569 569 0 1.0
21 radius_worst float64 569 569 0 1.0
22 texture_worst float64 569 569 0 1.0
23 perimeter_worst float64 569 569 0 1.0
24 area_worst float64 569 569 0 1.0
25 smoothness_worst float64 569 569 0 1.0
26 compactness_worst float64 569 569 0 1.0
27 concavity_worst float64 569 569 0 1.0
28 concave points_worst float64 569 569 0 1.0
29 symmetry_worst float64 569 569 0 1.0
30 fractal_dimension_worst float64 569 569 0 1.0
In [17]:
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
Out[17]:
0
In [18]:
##################################
# Identifying the rows
# with Fill.Rate < 0.90
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<0.90)]
In [19]:
##################################
# Gathering the indices for each observation
##################################
row_index_list = breast_cancer.index
In [20]:
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(breast_cancer.columns)] * len(breast_cancer))
In [21]:
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(breast_cancer.isna().sum(axis=1))
In [22]:
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
In [23]:
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_index_list,
                                           column_count_list,
                                           null_row_list,
                                           missing_rate_list), 
                                        columns=['Row.Name',
                                                 'Column.Count',
                                                 'Null.Count',                                                 
                                                 'Missing.Rate'])
display(all_row_quality_summary)
Row.Name Column.Count Null.Count Missing.Rate
0 842302 31 0 0.0
1 842517 31 0 0.0
2 84300903 31 0 0.0
3 84348301 31 0 0.0
4 84358402 31 0 0.0
... ... ... ... ...
564 926424 31 0 0.0
565 926682 31 0 0.0
566 926954 31 0 0.0
567 927241 31 0 0.0
568 92751 31 0 0.0

569 rows × 4 columns

In [24]:
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
Out[24]:
0
In [25]:
##################################
# Formulating the dataset
# with numeric columns only
##################################
breast_cancer_numeric = breast_cancer.select_dtypes(include='number')
In [26]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = breast_cancer_numeric.columns
In [27]:
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = breast_cancer_numeric.min()
In [28]:
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = breast_cancer_numeric.mean()
In [29]:
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = breast_cancer_numeric.median()
In [30]:
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = breast_cancer_numeric.max()
In [31]:
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [breast_cancer[x].value_counts(dropna=True).index.tolist()[0] for x in breast_cancer_numeric]
In [32]:
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [breast_cancer[x].value_counts(dropna=True).index.tolist()[1] for x in breast_cancer_numeric]
In [33]:
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [breast_cancer_numeric[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in breast_cancer_numeric]
In [34]:
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [breast_cancer_numeric[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in breast_cancer_numeric]
In [35]:
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
In [36]:
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = breast_cancer_numeric.nunique(dropna=True)
In [37]:
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(breast_cancer_numeric)] * len(breast_cancer_numeric.columns))
In [38]:
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
In [39]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = breast_cancer_numeric.skew()
In [40]:
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = breast_cancer_numeric.kurtosis()
In [41]:
##################################
# Generating a column quality summary for the numeric column
##################################
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                numeric_minimum_list,
                                                numeric_mean_list,
                                                numeric_median_list,
                                                numeric_maximum_list,
                                                numeric_first_mode_list,
                                                numeric_second_mode_list,
                                                numeric_first_mode_count_list,
                                                numeric_second_mode_count_list,
                                                numeric_first_second_mode_ratio_list,
                                                numeric_unique_count_list,
                                                numeric_row_count_list,
                                                numeric_unique_count_ratio_list,
                                                numeric_skewness_list,
                                                numeric_kurtosis_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Minimum',
                                                 'Mean',
                                                 'Median',
                                                 'Maximum',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio',
                                                 'Skewness',
                                                 'Kurtosis'])
display(numeric_column_quality_summary)
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
0 radius_mean 6.981000 14.127292 13.370000 28.11000 12.340000 11.060000 4 3 1.333333 456 569 0.801406 0.942380 0.845522
1 texture_mean 9.710000 19.289649 18.840000 39.28000 16.840000 19.830000 3 3 1.000000 479 569 0.841828 0.650450 0.758319
2 perimeter_mean 43.790000 91.969033 86.240000 188.50000 82.610000 134.700000 3 3 1.000000 522 569 0.917399 0.990650 0.972214
3 area_mean 143.500000 654.889104 551.100000 2501.00000 512.200000 394.100000 3 2 1.500000 539 569 0.947276 1.645732 3.652303
4 smoothness_mean 0.052630 0.096360 0.095870 0.16340 0.100700 0.105400 5 4 1.250000 474 569 0.833040 0.456324 0.855975
5 compactness_mean 0.019380 0.104341 0.092630 0.34540 0.114700 0.120600 3 3 1.000000 537 569 0.943761 1.190123 1.650130
6 concavity_mean 0.000000 0.088799 0.061540 0.42680 0.000000 0.120400 13 3 4.333333 537 569 0.943761 1.401180 1.998638
7 concave points_mean 0.000000 0.048919 0.033500 0.20120 0.000000 0.028640 13 3 4.333333 542 569 0.952548 1.171180 1.066556
8 symmetry_mean 0.106000 0.181162 0.179200 0.30400 0.176900 0.189300 4 4 1.000000 432 569 0.759227 0.725609 1.287933
9 fractal_dimension_mean 0.049960 0.062798 0.061540 0.09744 0.067820 0.061130 3 3 1.000000 499 569 0.876977 1.304489 3.005892
10 radius_se 0.111500 0.405172 0.324200 2.87300 0.286000 0.220400 3 3 1.000000 540 569 0.949033 3.088612 17.686726
11 texture_se 0.360200 1.216853 1.108000 4.88500 0.856100 1.350000 3 3 1.000000 519 569 0.912127 1.646444 5.349169
12 perimeter_se 0.757000 2.866059 2.287000 21.98000 1.778000 1.143000 4 2 2.000000 533 569 0.936731 3.443615 21.401905
13 area_se 6.802000 40.337079 24.530000 542.20000 16.970000 16.640000 3 3 1.000000 528 569 0.927944 5.447186 49.209077
14 smoothness_se 0.001713 0.007041 0.006380 0.03113 0.005910 0.006064 2 2 1.000000 547 569 0.961336 2.314450 10.469840
15 compactness_se 0.002252 0.025478 0.020450 0.13540 0.018120 0.011040 3 3 1.000000 541 569 0.950791 1.902221 5.106252
16 concavity_se 0.000000 0.031894 0.025890 0.39600 0.000000 0.021850 13 2 6.500000 533 569 0.936731 5.110463 48.861395
17 concave points_se 0.000000 0.011796 0.010930 0.05279 0.000000 0.011670 13 3 4.333333 507 569 0.891037 1.444678 5.126302
18 symmetry_se 0.007882 0.020542 0.018730 0.07895 0.013440 0.020450 4 3 1.333333 498 569 0.875220 2.195133 7.896130
19 fractal_dimension_se 0.000895 0.003795 0.003187 0.02984 0.002256 0.002205 2 2 1.000000 545 569 0.957821 3.923969 26.280847
20 radius_worst 7.930000 16.269190 14.970000 36.04000 12.360000 13.500000 5 4 1.250000 457 569 0.803163 1.103115 0.944090
21 texture_worst 12.020000 25.677223 25.410000 49.54000 17.700000 27.260000 3 3 1.000000 511 569 0.898067 0.498321 0.224302
22 perimeter_worst 50.410000 107.261213 97.660000 251.20000 117.700000 105.900000 3 3 1.000000 514 569 0.903339 1.128164 1.070150
23 area_worst 185.200000 880.583128 686.500000 4254.00000 698.800000 808.900000 2 2 1.000000 544 569 0.956063 1.859373 4.396395
24 smoothness_worst 0.071170 0.132369 0.131300 0.22260 0.140100 0.131200 4 4 1.000000 411 569 0.722320 0.415426 0.517825
25 compactness_worst 0.027290 0.254265 0.211900 1.05800 0.148600 0.341600 3 3 1.000000 529 569 0.929701 1.473555 3.039288
26 concavity_worst 0.000000 0.272188 0.226700 1.25200 0.000000 0.450400 13 3 4.333333 539 569 0.947276 1.150237 1.615253
27 concave points_worst 0.000000 0.114606 0.099930 0.29100 0.000000 0.110500 13 3 4.333333 492 569 0.864675 0.492616 -0.535535
28 symmetry_worst 0.156500 0.290076 0.282200 0.66380 0.236900 0.310900 3 3 1.000000 500 569 0.878735 1.433928 4.444560
29 fractal_dimension_worst 0.055040 0.083946 0.080040 0.20750 0.074270 0.087010 3 2 1.500000 535 569 0.940246 1.662579 5.244611
In [42]:
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>10)])
Out[42]:
0
In [43]:
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
Out[43]:
0
In [44]:
#################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])
Out[44]:
5
In [45]:
##################################
# Identifying the numerical columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
display(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))].sort_values(by=['Skewness'], ascending=False))
Numeric.Column.Name Minimum Mean Median Maximum First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio Skewness Kurtosis
13 area_se 6.802000 40.337079 24.530000 542.20000 16.970000 16.640000 3 3 1.0 528 569 0.927944 5.447186 49.209077
16 concavity_se 0.000000 0.031894 0.025890 0.39600 0.000000 0.021850 13 2 6.5 533 569 0.936731 5.110463 48.861395
19 fractal_dimension_se 0.000895 0.003795 0.003187 0.02984 0.002256 0.002205 2 2 1.0 545 569 0.957821 3.923969 26.280847
12 perimeter_se 0.757000 2.866059 2.287000 21.98000 1.778000 1.143000 4 2 2.0 533 569 0.936731 3.443615 21.401905
10 radius_se 0.111500 0.405172 0.324200 2.87300 0.286000 0.220400 3 3 1.0 540 569 0.949033 3.088612 17.686726
In [46]:
##################################
# Formulating the dataset
# with categorical columns only
##################################
breast_cancer_categorical = breast_cancer.select_dtypes(include=['category','object'])
In [47]:
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = breast_cancer_categorical.columns
In [48]:
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [breast_cancer[x].value_counts().index.tolist()[0] for x in breast_cancer_categorical]
In [49]:
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [breast_cancer[x].value_counts().index.tolist()[1] for x in breast_cancer_categorical]
In [50]:
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [breast_cancer_categorical[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in breast_cancer_categorical]
In [51]:
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [breast_cancer_categorical[x].isin([breast_cancer[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in breast_cancer_categorical]
In [52]:
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
In [53]:
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = breast_cancer_categorical.nunique(dropna=True)
In [54]:
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(breast_cancer_categorical)] * len(breast_cancer_categorical.columns))
In [55]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
In [56]:
##################################
# Generating a column quality summary for the categorical columns
##################################
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
                                                    categorical_first_mode_list,
                                                    categorical_second_mode_list,
                                                    categorical_first_mode_count_list,
                                                    categorical_second_mode_count_list,
                                                    categorical_first_second_mode_ratio_list,
                                                    categorical_unique_count_list,
                                                    categorical_row_count_list,
                                                    categorical_unique_count_ratio_list), 
                                        columns=['Categorical.Column.Name',
                                                 'First.Mode',
                                                 'Second.Mode',
                                                 'First.Mode.Count',
                                                 'Second.Mode.Count',
                                                 'First.Second.Mode.Ratio',
                                                 'Unique.Count',
                                                 'Row.Count',
                                                 'Unique.Count.Ratio'])
display(categorical_column_quality_summary)
Categorical.Column.Name First.Mode Second.Mode First.Mode.Count Second.Mode.Count First.Second.Mode.Ratio Unique.Count Row.Count Unique.Count.Ratio
0 diagnosis B M 357 212 1.683962 2 569 0.003515
In [57]:
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
Out[57]:
0
In [58]:
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
Out[58]:
0

1.4. Data Preprocessing ¶

1.4.1 Data Splitting¶

  1. The baseline dataset is comprised of:
    • 569 rows (observations)
      • 357 diagnosis=B: 62.74%
      • 212 diagnosis=M: 37.26%
    • 31 columns (variables)
      • 1/31 target (categorical)
        • diagnosis
      • 30/31 predictor (numeric)
        • radius_mean
        • texture_mean
        • perimeter_mean
        • area_mean
        • smoothness_mean
        • compactness_mean
        • concavity_mean
        • concave points_mean
        • symmetry_mean
        • fractal_dimension_mean
        • radius_se
        • texture_se
        • perimeter_se
        • area_se
        • smoothness_se
        • compactness_se
        • concavity_se
        • concave points_se
        • symmetry_se
        • fractal_dimension_se
        • radius_worst
        • texture_worst
        • perimeter_worst
        • area_worst
        • smoothness_worst
        • compactness_worst
        • concavity_worst
        • concave points_worst
        • symmetry_worst
        • fractal_dimension_worst
  2. The baseline dataset was divided into three subsets using a fixed random seed:
    • test data: 25% of the original data with class stratification applied
    • train data (initial): 75% of the original data with class stratification applied
      • train data (final): 75% of the train (initial) data with class stratification applied
      • validation data: 25% of the train (initial) data with class stratification applied
  3. Models were developed from the train data (final). Using the same dataset, a subset of models with optimal hyperparameters were selected, based on cross-validation.
  4. Among candidate models with optimal hyperparameters, the final model was selected based on performance on the validation data.
  5. Performance of the selected final model (and other candidate models for post-model selection comparison) were evaluated using the test data.
  6. The train data (final) subset is comprised of:
    • 319 rows (observations)
      • 200 diagnosis=B: 62.69%
      • 119 diagnosis=M: 37.30%
    • 31 columns (variables)
  7. The validation data subset is comprised of:
    • 107 rows (observations)
      • 67 diagnosis=B: 62.61%
      • 40 diagnosis=M: 37.38%
    • 31 columns (variables)
  8. The test data subset is comprised of:
    • 143 rows (observations)
      • 90 diagnosis=B: 62.93%
      • 53 diagnosis=M: 37.06%
    • 31 columns (variables)
In [59]:
##################################
# Creating a dataset copy
# of the original data
##################################
breast_cancer_baseline = breast_cancer.copy()
In [60]:
##################################
# Performing a general exploration
# of the baseline dataset
##################################
print('Final Dataset Dimensions: ')
display(breast_cancer_baseline.shape)
Final Dataset Dimensions: 
(569, 31)
In [61]:
##################################
# Obtaining the distribution of
# of the target variable
##################################
print('Target Variable Breakdown: ')
breast_cancer_breakdown = breast_cancer_baseline.groupby('diagnosis', observed=True).size().reset_index(name='Count')
breast_cancer_breakdown['Percentage'] = (breast_cancer_breakdown['Count'] / len(breast_cancer_baseline)) * 100
display(breast_cancer_breakdown)
Target Variable Breakdown: 
diagnosis Count Percentage
0 B 357 62.741652
1 M 212 37.258348
In [62]:
##################################
# Formulating the train and test data
# from the final dataset
# by applying stratification and
# using a 75-25 ratio
##################################
breast_cancer_train_initial, breast_cancer_test = train_test_split(breast_cancer_baseline, 
                                                               test_size=0.25, 
                                                               stratify=breast_cancer_baseline['diagnosis'], 
                                                               random_state=987654321)
In [63]:
##################################
# Performing a general exploration
# of the initial training dataset
##################################
X_train_initial = breast_cancer_train_initial.drop('diagnosis', axis = 1)
y_train_initial = breast_cancer_train_initial['diagnosis']
print('Initial Train Dataset Dimensions: ')
display(X_train_initial.shape)
display(y_train_initial.shape)
print('Initial Train Target Variable Breakdown: ')
display(y_train_initial.value_counts())
print('Initial Train Target Variable Proportion: ')
display(y_train_initial.value_counts(normalize = True))
Initial Train Dataset Dimensions: 
(426, 30)
(426,)
Initial Train Target Variable Breakdown: 
diagnosis
B    267
M    159
Name: count, dtype: int64
Initial Train Target Variable Proportion: 
diagnosis
B    0.626761
M    0.373239
Name: proportion, dtype: float64
In [64]:
##################################
# Performing a general exploration
# of the test dataset
##################################
X_test = breast_cancer_test.drop('diagnosis', axis = 1)
y_test = breast_cancer_test['diagnosis']
print('Test Dataset Dimensions: ')
display(X_test.shape)
display(y_test.shape)
print('Test Target Variable Breakdown: ')
display(y_test.value_counts())
print('Test Target Variable Proportion: ')
display(y_test.value_counts(normalize = True))
Test Dataset Dimensions: 
(143, 30)
(143,)
Test Target Variable Breakdown: 
diagnosis
B    90
M    53
Name: count, dtype: int64
Test Target Variable Proportion: 
diagnosis
B    0.629371
M    0.370629
Name: proportion, dtype: float64
In [65]:
##################################
# Formulating the train and validation data
# from the train dataset
# by applying stratification and
# using a 75-25 ratio
##################################
breast_cancer_train, breast_cancer_validation = train_test_split(breast_cancer_train_initial, 
                                                             test_size=0.25, 
                                                             stratify=breast_cancer_train_initial['diagnosis'], 
                                                             random_state=987654321)
In [66]:
##################################
# Performing a general exploration
# of the final training dataset
##################################
X_train = breast_cancer_train.drop('diagnosis', axis = 1)
y_train = breast_cancer_train['diagnosis']
print('Final Train Dataset Dimensions: ')
display(X_train.shape)
display(y_train.shape)
print('Final Train Target Variable Breakdown: ')
display(y_train.value_counts())
print('Final Train Target Variable Proportion: ')
display(y_train.value_counts(normalize = True))
Final Train Dataset Dimensions: 
(319, 30)
(319,)
Final Train Target Variable Breakdown: 
diagnosis
B    200
M    119
Name: count, dtype: int64
Final Train Target Variable Proportion: 
diagnosis
B    0.626959
M    0.373041
Name: proportion, dtype: float64
In [67]:
##################################
# Performing a general exploration
# of the validation dataset
##################################
X_validation = breast_cancer_validation.drop('diagnosis', axis = 1)
y_validation = breast_cancer_validation['diagnosis']
print('Validation Dataset Dimensions: ')
display(X_validation.shape)
display(y_validation.shape)
print('Validation Target Variable Breakdown: ')
display(y_validation.value_counts())
print('Validation Target Variable Proportion: ')
display(y_validation.value_counts(normalize = True))
Validation Dataset Dimensions: 
(107, 30)
(107,)
Validation Target Variable Breakdown: 
diagnosis
B    67
M    40
Name: count, dtype: int64
Validation Target Variable Proportion: 
diagnosis
B    0.626168
M    0.373832
Name: proportion, dtype: float64
In [68]:
##################################
# Saving the training data
# to the DATASETS_FINAL_TRAIN_PATH
# and DATASETS_FINAL_TRAIN_FEATURES_PATH
# and DATASETS_FINAL_TRAIN_TARGET_PATH
##################################
breast_cancer_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_PATH, "breast_cancer_train.csv"), index=False)
X_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_FEATURES_PATH, "X_train.csv"), index=False)
y_train.to_csv(os.path.join("..", DATASETS_FINAL_TRAIN_TARGET_PATH, "y_train.csv"), index=False)
In [69]:
##################################
# Saving the validation data
# to the DATASETS_FINAL_VALIDATION_PATH
# and DATASETS_FINAL_VALIDATION_FEATURE_PATH
# and DATASETS_FINAL_VALIDATION_TARGET_PATH
##################################
breast_cancer_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_PATH, "breast_cancer_validation.csv"), index=False)
X_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_FEATURES_PATH, "X_validation.csv"), index=False)
y_validation.to_csv(os.path.join("..", DATASETS_FINAL_VALIDATION_TARGET_PATH, "y_validation.csv"), index=False)
In [70]:
##################################
# Saving the test data
# to the DATASETS_FINAL_TEST_PATH
# and DATASETS_FINAL_TEST_FEATURES_PATH
# and DATASETS_FINAL_TEST_TARGET_PATH
##################################
breast_cancer_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_PATH, "breast_cancer_test.csv"), index=False)
X_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_FEATURES_PATH, "X_test.csv"), index=False)
y_test.to_csv(os.path.join("..", DATASETS_FINAL_TEST_TARGET_PATH, "y_test.csv"), index=False)

1.4.2 Outlier and Distributional Shape Analysis¶

Outlier and distributional shape analysis findings based on assessment of the training data are as follows:

  1. High skewness observed for 5 variables with Skewness>3 or Skewness<(-3).
    • area_se: Skewness = 6.562
    • concavity_se: Skewness = 5.648
    • fractal_dimension_se: Skewness = 4.280
    • perimeter_se: Skewness = 4.136
    • radius_se: Skewness = 3.775
  2. Relatively high number of outliers observed for 7 numeric variables with Outlier.Ratio>0.05.
    • area_se: Outlier.Ratio = 0.110
    • radius_se: Outlier.Ratio = 0.075
    • perimeter_se: Outlier.Ratio = 0.075
    • smoothness_se: Outlier.Ratio = 0.059
    • compactness_se: Outlier.Ratio = 0.059
    • fractal_dimension_se: Outlier.Ratio = 0.056
    • symmetry_se: Outlier.Ratio = 0.050
In [71]:
##################################
# Formulating the training dataset
# with numeric columns only
##################################
breast_cancer_train_numeric = breast_cancer_train.select_dtypes(include='number')
In [72]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(breast_cancer_train_numeric.columns)
In [73]:
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = breast_cancer_train_numeric.skew()
In [74]:
##################################
# Computing the interquartile range
# for all columns
##################################
breast_cancer_train_numeric_q1 = breast_cancer_train_numeric.quantile(0.25)
breast_cancer_train_numeric_q3 = breast_cancer_train_numeric.quantile(0.75)
breast_cancer_train_numeric_iqr = breast_cancer_train_numeric_q3 - breast_cancer_train_numeric_q1
In [75]:
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((breast_cancer_train_numeric < (breast_cancer_train_numeric_q1 - 1.5 * breast_cancer_train_numeric_iqr)) | (breast_cancer_train_numeric > (breast_cancer_train_numeric_q3 + 1.5 * breast_cancer_train_numeric_iqr))).sum()
In [76]:
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(breast_cancer_train_numeric)] * len(breast_cancer_train_numeric.columns))
In [77]:
##################################
# Gathering the unique to count ratio for each categorical column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
In [78]:
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
                                                  numeric_skewness_list,
                                                  numeric_outlier_count_list,
                                                  numeric_row_count_list,
                                                  numeric_outlier_ratio_list), 
                                        columns=['Numeric.Column.Name',
                                                 'Skewness',
                                                 'Outlier.Count',
                                                 'Row.Count',
                                                 'Outlier.Ratio'])
display(numeric_column_outlier_summary)
Numeric.Column.Name Skewness Outlier.Count Row.Count Outlier.Ratio
0 radius_mean 0.966211 6 319 0.018809
1 texture_mean 0.746964 4 319 0.012539
2 perimeter_mean 1.034320 6 319 0.018809
3 area_mean 1.819687 9 319 0.028213
4 smoothness_mean 0.166009 1 319 0.003135
5 compactness_mean 1.115958 6 319 0.018809
6 concavity_mean 1.412274 10 319 0.031348
7 concave points_mean 1.155582 11 319 0.034483
8 symmetry_mean 0.532891 7 319 0.021944
9 fractal_dimension_mean 1.054941 8 319 0.025078
10 radius_se 3.775498 24 319 0.075235
11 texture_se 1.464707 10 319 0.031348
12 perimeter_se 4.136225 24 319 0.075235
13 area_se 6.562034 35 319 0.109718
14 smoothness_se 1.313172 19 319 0.059561
15 compactness_se 1.701432 19 319 0.059561
16 concavity_se 5.648674 14 319 0.043887
17 concave points_se 1.592173 14 319 0.043887
18 symmetry_se 2.442436 16 319 0.050157
19 fractal_dimension_se 4.280973 18 319 0.056426
20 radius_worst 1.016127 3 319 0.009404
21 texture_worst 0.476084 2 319 0.006270
22 perimeter_worst 1.075965 5 319 0.015674
23 area_worst 1.892646 13 319 0.040752
24 smoothness_worst 0.237077 0 319 0.000000
25 compactness_worst 1.098476 6 319 0.018809
26 concavity_worst 1.067913 5 319 0.015674
27 concave points_worst 0.436446 0 319 0.000000
28 symmetry_worst 1.154060 10 319 0.031348
29 fractal_dimension_worst 1.001579 10 319 0.031348
In [79]:
##################################
# Identifying the numerical columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
display(numeric_column_outlier_summary[(numeric_column_outlier_summary['Skewness']>3) | (numeric_column_outlier_summary['Skewness']<(-3))].sort_values(by=['Skewness'], ascending=False))
Numeric.Column.Name Skewness Outlier.Count Row.Count Outlier.Ratio
13 area_se 6.562034 35 319 0.109718
16 concavity_se 5.648674 14 319 0.043887
19 fractal_dimension_se 4.280973 18 319 0.056426
12 perimeter_se 4.136225 24 319 0.075235
10 radius_se 3.775498 24 319 0.075235
In [80]:
##################################
# Identifying the numerical columns
# with Outlier.Ratio > 0.05
##################################
display(numeric_column_outlier_summary[numeric_column_outlier_summary['Outlier.Ratio']>0.05].sort_values(by=['Outlier.Ratio'], ascending=False))
Numeric.Column.Name Skewness Outlier.Count Row.Count Outlier.Ratio
13 area_se 6.562034 35 319 0.109718
10 radius_se 3.775498 24 319 0.075235
12 perimeter_se 4.136225 24 319 0.075235
14 smoothness_se 1.313172 19 319 0.059561
15 compactness_se 1.701432 19 319 0.059561
19 fractal_dimension_se 4.280973 18 319 0.056426
18 symmetry_se 2.442436 16 319 0.050157
In [81]:
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in breast_cancer_train_numeric:
        plt.figure(figsize=(17,1))
        sns.boxplot(data=breast_cancer_train_numeric, x=column)
        plt.show()
        plt.close()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

1.4.3 Collinearity¶

Collinearity evaluation findings based on assessment of the training data are as follows:

  1. Predictors were predominantly positively correlated with 50% reporting correlation values ranging from 0.129 to 0.558.
  2. High Pearson.Correlation values > 0.90 were noted for 4.60% (20/435) of the pairwise combinations of predictors:
    • radius_mean and perimeter_mean: Pearson.Correlation = 0.997
    • radius_worst and perimeter_worst: Pearson.Correlation = 0.993
    • perimeter_mean and area_mean: Pearson.Correlation = 0.985
    • radius_mean and area_mean: Pearson.Correlation = 0.984
    • radius_worst and area_worst: Pearson.Correlation = 0.982
    • perimeter_worst and area_worst: Pearson.Correlation = 0.978
    • perimeter_mean and perimeter_worst: Pearson.Correlation = 0.972
    • perimeter_mean and radius_worst: Pearson.Correlation = 0.972
    • radius_mean and radius_worst: Pearson.Correlation = 0.971
    • radius_se and perimeter_se: Pearson.Correlation = 0.971
    • radius_mean and perimeter_worst: Pearson.Correlation = 0.967
    • area_mean and area_worst: Pearson.Correlation = 0.964
    • area_mean and radius_worst: Pearson.Correlation = 0.958
    • area_mean and perimeter_worst: Pearson.Correlation = 0.955
    • perimeter_mean and area_worst: Pearson.Correlation = 0.951
    • radius_se and area_se: Pearson.Correlation = 0.948
    • radius_mean and area_worst: Pearson.Correlation = 0.948
    • perimeter_se and area_se: Pearson.Correlation = 0.942
    • texture_mean and texture_worst: Pearson.Correlation = 0.923
    • concave points_mean and concave points_worst: Pearson.Correlation = 0.911
    • concavity_mean and concave points_mean: Pearson.Correlation = 0.900
In [82]:
##################################
# Creating a dataset copy
# with only the predictors present
# for correlation analysis
##################################
breast_cancer_train_correlation = breast_cancer_train.drop(['diagnosis'], axis=1)
display(breast_cancer_train_correlation)
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
id
868826 14.950 17.57 96.85 678.1 0.11670 0.13050 0.15390 0.08624 0.1957 0.06216 ... 18.55 21.43 121.40 971.4 0.1411 0.21640 0.33550 0.16670 0.3414 0.07147
8810703 28.110 18.47 188.50 2499.0 0.11420 0.15160 0.32010 0.15950 0.1648 0.05525 ... 28.11 18.47 188.50 2499.0 0.1142 0.15160 0.32010 0.15950 0.1648 0.05525
906878 13.660 19.13 89.46 575.3 0.09057 0.11470 0.09657 0.04812 0.1848 0.06181 ... 15.14 25.50 101.40 708.8 0.1147 0.31670 0.36600 0.14070 0.2744 0.08839
911654 14.200 20.53 92.41 618.4 0.08931 0.11080 0.05063 0.03058 0.1506 0.06009 ... 16.45 27.26 112.10 828.5 0.1153 0.34290 0.25120 0.13390 0.2534 0.07858
903483 8.734 16.84 55.27 234.3 0.10390 0.07428 0.00000 0.00000 0.1985 0.07098 ... 10.17 22.80 64.01 317.0 0.1460 0.13100 0.00000 0.00000 0.2445 0.08865
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84862001 16.130 20.68 108.10 798.8 0.11700 0.20220 0.17220 0.10280 0.2164 0.07356 ... 20.96 31.48 136.80 1315.0 0.1789 0.42330 0.47840 0.20730 0.3706 0.11420
90317302 10.260 12.22 65.75 321.6 0.09996 0.07542 0.01923 0.01968 0.1800 0.06569 ... 11.38 15.65 73.23 394.5 0.1343 0.16500 0.08615 0.06696 0.2937 0.07722
86211 12.180 17.84 77.79 451.1 0.10450 0.07057 0.02490 0.02941 0.1900 0.06635 ... 12.83 20.92 82.14 495.2 0.1140 0.09358 0.04980 0.05882 0.2227 0.07376
926954 16.600 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 0.05648 ... 18.98 34.12 126.70 1124.0 0.1139 0.30940 0.34030 0.14180 0.2218 0.07820
86208 20.260 23.03 132.40 1264.0 0.09078 0.13130 0.14650 0.08683 0.2095 0.05649 ... 24.22 31.59 156.10 1750.0 0.1190 0.35390 0.40980 0.15730 0.3689 0.08368

319 rows × 30 columns

In [83]:
##################################
# Initializing the correlation matrix
##################################
breast_cancer_train_correlation_matrix = pd.DataFrame(np.zeros((len(breast_cancer_train_correlation.columns), len(breast_cancer_train_correlation.columns))),
                                                       columns=breast_cancer_train_correlation.columns,
                                                       index=breast_cancer_train_correlation.columns)
In [84]:
##################################
# Calculating different types
# of correlation coefficients
# per variable type
##################################
for i in range(len(breast_cancer_train_correlation.columns)):
    for j in range(i, len(breast_cancer_train_correlation.columns)):
        if i == j:
            breast_cancer_train_correlation_matrix.iloc[i, j] = 1.0  
        else:
            col_i = breast_cancer_train_correlation.iloc[:, i]
            col_j = breast_cancer_train_correlation.iloc[:, j]

            # Detecting binary variables (assumes binary variables are coded as 0/1)
            is_binary_i = col_i.nunique() == 2
            is_binary_j = col_j.nunique() == 2

            # Computing the Pearson correlation for two continuous variables
            if col_i.dtype in ['int64', 'float64'] and col_j.dtype in ['int64', 'float64']:
                corr = col_i.corr(col_j)

            # Computing the Point-Biserial correlation for continuous and binary variables
            elif (col_i.dtype in ['int64', 'float64'] and is_binary_j) or (col_j.dtype in ['int64', 'float64'] and is_binary_i):
                continuous_var = col_i if col_i.dtype in ['int64', 'float64'] else col_j
                binary_var = col_j if is_binary_j else col_i

                # Convert binary variable to 0/1 (if not already)
                binary_var = binary_var.astype('category').cat.codes
                corr, _ = pointbiserialr(continuous_var, binary_var)

            # Computing the Phi coefficient for two binary variables
            elif is_binary_i and is_binary_j:
                corr = col_i.corr(col_j) 

            # Computing the Cramér's V for two categorical variables (if more than 2 categories)
            else:
                contingency_table = pd.crosstab(col_i, col_j)
                chi2, _, _, _ = chi2_contingency(contingency_table)
                n = contingency_table.sum().sum()
                phi2 = chi2 / n
                r, k = contingency_table.shape
                corr = np.sqrt(phi2 / min(k - 1, r - 1))  # Cramér's V formula

            # Assigning correlation values to the matrix
            breast_cancer_train_correlation_matrix.iloc[i, j] = corr
            breast_cancer_train_correlation_matrix.iloc[j, i] = corr
            # Displaying the correlation matrix
display(breast_cancer_train_correlation_matrix)
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
radius_mean 1.000000 0.296754 0.997917 0.984916 0.145069 0.521699 0.653960 0.829568 0.178392 -0.362551 ... 0.971938 0.289640 0.967088 0.948294 0.072870 0.428210 0.496566 0.737347 0.217375 -0.038225
texture_mean 0.296754 1.000000 0.300384 0.293404 -0.081253 0.194167 0.243735 0.250668 0.068573 -0.098025 ... 0.319664 0.923246 0.322248 0.312728 0.004546 0.214524 0.207238 0.222253 0.068177 0.054058
perimeter_mean 0.997917 0.300384 1.000000 0.985186 0.180173 0.570667 0.691792 0.855888 0.209569 -0.313834 ... 0.972461 0.293613 0.972875 0.951121 0.103165 0.468946 0.531674 0.762926 0.235097 0.006081
area_mean 0.984916 0.293404 0.985186 1.000000 0.155662 0.519067 0.673052 0.829811 0.177404 -0.320518 ... 0.958283 0.275844 0.955452 0.964157 0.080662 0.406863 0.489149 0.712358 0.188049 -0.033382
smoothness_mean 0.145069 -0.081253 0.180173 0.155662 1.000000 0.618184 0.497254 0.538361 0.530327 0.518288 ... 0.197350 -0.022554 0.220016 0.198870 0.792618 0.414057 0.397902 0.487924 0.337613 0.444924
compactness_mean 0.521699 0.194167 0.570667 0.519067 0.618184 1.000000 0.878930 0.834485 0.586311 0.503176 ... 0.558509 0.206030 0.612310 0.546968 0.515692 0.862123 0.797969 0.821965 0.453133 0.659234
concavity_mean 0.653960 0.243735 0.691792 0.673052 0.497254 0.878930 1.000000 0.900196 0.510785 0.329362 ... 0.675255 0.253589 0.714621 0.683360 0.417894 0.751441 0.886489 0.851058 0.397756 0.514930
concave points_mean 0.829568 0.250668 0.855888 0.829811 0.538361 0.834485 0.900196 1.000000 0.454541 0.108612 ... 0.846434 0.268006 0.868059 0.835507 0.422292 0.668006 0.723963 0.911806 0.374473 0.331439
symmetry_mean 0.178392 0.068573 0.209569 0.177404 0.530327 0.586311 0.510785 0.454541 1.000000 0.419840 ... 0.220408 0.083343 0.247568 0.214373 0.412629 0.476576 0.453587 0.439723 0.689259 0.420701
fractal_dimension_mean -0.362551 -0.098025 -0.313834 -0.320518 0.518288 0.503176 0.329362 0.108612 0.419840 1.000000 ... -0.307089 -0.093992 -0.258932 -0.270836 0.447918 0.394331 0.327619 0.123832 0.196195 0.759504
radius_se 0.658347 0.229739 0.669725 0.719571 0.280807 0.473529 0.591306 0.661259 0.267726 -0.028843 ... 0.679356 0.147109 0.678844 0.727147 0.080241 0.249196 0.326790 0.483566 0.057317 0.006544
texture_se -0.063347 0.436088 -0.053460 -0.035532 0.067398 0.045054 0.074459 0.026147 0.063259 0.175273 ... -0.083551 0.452090 -0.077713 -0.060645 -0.103064 -0.130696 -0.092730 -0.120547 -0.195414 -0.054273
perimeter_se 0.663993 0.235670 0.681743 0.726247 0.277255 0.528764 0.621664 0.679461 0.276983 0.009488 ... 0.669812 0.153531 0.688770 0.718316 0.069734 0.309735 0.366793 0.510473 0.065297 0.048577
area_se 0.696051 0.210780 0.705619 0.774018 0.219236 0.429172 0.573166 0.643176 0.197534 -0.112437 ... 0.699869 0.139774 0.702817 0.771195 0.061473 0.249041 0.328935 0.479149 0.041415 -0.023982
smoothness_se -0.282663 -0.026715 -0.262615 -0.208247 0.309776 0.093602 0.069748 -0.028748 0.132195 0.446319 ... -0.281678 -0.113239 -0.270776 -0.216902 0.305564 -0.101057 -0.080945 -0.155433 -0.184909 0.113992
compactness_se 0.161000 0.116722 0.204162 0.180221 0.233059 0.706181 0.646702 0.440636 0.377026 0.595560 ... 0.166451 0.053456 0.222276 0.182170 0.131374 0.632269 0.608224 0.448014 0.163800 0.599360
concavity_se 0.101351 0.044926 0.132131 0.129111 0.203394 0.508586 0.664396 0.354830 0.340789 0.506201 ... 0.107206 0.007025 0.142656 0.128633 0.123140 0.429350 0.652703 0.382299 0.170189 0.461893
concave points_se 0.338116 0.079973 0.368434 0.339009 0.362823 0.650154 0.690708 0.591120 0.374946 0.383054 ... 0.330233 0.012760 0.363817 0.327707 0.160674 0.451405 0.556709 0.590946 0.107170 0.332135
symmetry_se -0.020080 0.031838 -0.001422 0.014059 0.160089 0.209686 0.200530 0.120330 0.373312 0.267390 ... -0.055671 -0.059537 -0.041901 -0.037870 -0.070226 -0.020577 0.024522 -0.022468 0.320748 -0.020828
fractal_dimension_se -0.086706 -0.004000 -0.051803 -0.054896 0.200008 0.457416 0.433186 0.204754 0.284368 0.698610 ... -0.077667 -0.064249 -0.042828 -0.050680 0.086398 0.336647 0.354796 0.174112 0.015405 0.582141
radius_worst 0.971938 0.319664 0.972461 0.958283 0.197350 0.558509 0.675255 0.846434 0.220408 -0.307089 ... 1.000000 0.341791 0.993610 0.982412 0.175453 0.494388 0.550967 0.788192 0.294281 0.050938
texture_worst 0.289640 0.923246 0.293613 0.275844 -0.022554 0.206030 0.253589 0.268006 0.083343 -0.093992 ... 0.341791 1.000000 0.345039 0.323485 0.145721 0.290799 0.277103 0.299552 0.189918 0.139916
perimeter_worst 0.967088 0.322248 0.972875 0.955452 0.220016 0.612310 0.714621 0.868059 0.247568 -0.258932 ... 0.993610 0.345039 1.000000 0.978668 0.196497 0.553308 0.597206 0.816546 0.310463 0.104998
area_worst 0.948294 0.312728 0.951121 0.964157 0.198870 0.546968 0.683360 0.835507 0.214373 -0.270836 ... 0.982412 0.323485 0.978668 1.000000 0.174507 0.467797 0.537041 0.755701 0.258457 0.050037
smoothness_worst 0.072870 0.004546 0.103165 0.080662 0.792618 0.515692 0.417894 0.422292 0.412629 0.447918 ... 0.175453 0.145721 0.196497 0.174507 1.000000 0.513382 0.478523 0.506041 0.446709 0.579201
compactness_worst 0.428210 0.214524 0.468946 0.406863 0.414057 0.862123 0.751441 0.668006 0.476576 0.394331 ... 0.494388 0.290799 0.553308 0.467797 0.513382 1.000000 0.869064 0.805226 0.555227 0.782035
concavity_worst 0.496566 0.207238 0.531674 0.489149 0.397902 0.797969 0.886489 0.723963 0.453587 0.327619 ... 0.550967 0.277103 0.597206 0.537041 0.478523 0.869064 1.000000 0.834462 0.510184 0.666844
concave points_worst 0.737347 0.222253 0.762926 0.712358 0.487924 0.821965 0.851058 0.911806 0.439723 0.123832 ... 0.788192 0.299552 0.816546 0.755701 0.506041 0.805226 0.834462 1.000000 0.496234 0.478328
symmetry_worst 0.217375 0.068177 0.235097 0.188049 0.337613 0.453133 0.397756 0.374473 0.689259 0.196195 ... 0.294281 0.189918 0.310463 0.258457 0.446709 0.555227 0.510184 0.496234 1.000000 0.427291
fractal_dimension_worst -0.038225 0.054058 0.006081 -0.033382 0.444924 0.659234 0.514930 0.331439 0.420701 0.759504 ... 0.050938 0.139916 0.104998 0.050037 0.579201 0.782035 0.666844 0.478328 0.427291 1.000000

30 rows × 30 columns

In [85]:
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
plt.figure(figsize=(25, 12))
sns.heatmap(breast_cancer_train_correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
No description has been provided for this image
In [86]:
##################################
# Formulating the pairwise correlation summary
# between the predictor columns
##################################
# Converting the correlation matrix to a long format
breast_cancer_train_correlation_summary = (
    breast_cancer_train_correlation_matrix
    # keeping the upper triangle of the correlation matrix
    .where(~np.tril(np.ones(breast_cancer_train_correlation_matrix.shape)).astype(bool))  
    # convering to a long format
    .stack()  
    .reset_index()
)

# Renaming the summary columns
breast_cancer_train_correlation_summary.columns = ['Predictor1.Column.Name', 'Predictor2.Column.Name', 'Pearson.Correlation']

# Sorting from highest to lowest
breast_cancer_train_correlation_summary = breast_cancer_train_correlation_summary.sort_values(by='Pearson.Correlation', ascending=False).reset_index(drop=True)

# Displaying the summary table
display(breast_cancer_train_correlation_summary)
Predictor1.Column.Name Predictor2.Column.Name Pearson.Correlation
0 radius_mean perimeter_mean 0.997917
1 radius_worst perimeter_worst 0.993610
2 perimeter_mean area_mean 0.985186
3 radius_mean area_mean 0.984916
4 radius_worst area_worst 0.982412
... ... ... ...
430 radius_mean smoothness_se -0.282663
431 fractal_dimension_mean radius_worst -0.307089
432 perimeter_mean fractal_dimension_mean -0.313834
433 area_mean fractal_dimension_mean -0.320518
434 radius_mean fractal_dimension_mean -0.362551

435 rows × 3 columns

In [87]:
##################################
# Exploring the pairwise correlation values
# between the predictor columns
##################################
breast_cancer_train_correlation_exploration = (
    breast_cancer_train_correlation_matrix
    .where(~np.tril(np.ones(breast_cancer_train_correlation_matrix.shape)).astype(bool))
    .stack()
    .values
)

# Computing the quartiles and IQR
correlation_q1 = np.percentile(breast_cancer_train_correlation_exploration, 25)
correlation_q3 = np.percentile(breast_cancer_train_correlation_exploration, 75)
correlation_iqr = correlation_q3 - correlation_q1

print(f"Q1 (25th percentile): {correlation_q1:.3f}")
print(f"Q3 (75th percentile): {correlation_q3:.3f}")
print(f"IQR (Q3 - Q1): {correlation_iqr:.3f}")
Q1 (25th percentile): 0.129
Q3 (75th percentile): 0.558
IQR (Q3 - Q1): 0.429
In [88]:
##################################
# Determining the highly collinear predictors
# with Pearson Correlation > 0.90
##################################
breast_cancer_train_correlation_summary_highcollinearity = breast_cancer_train_correlation_summary[breast_cancer_train_correlation_summary['Pearson.Correlation'].abs() > 0.90].reset_index(drop=True)
display(breast_cancer_train_correlation_summary_highcollinearity)
Predictor1.Column.Name Predictor2.Column.Name Pearson.Correlation
0 radius_mean perimeter_mean 0.997917
1 radius_worst perimeter_worst 0.993610
2 perimeter_mean area_mean 0.985186
3 radius_mean area_mean 0.984916
4 radius_worst area_worst 0.982412
5 perimeter_worst area_worst 0.978668
6 perimeter_mean perimeter_worst 0.972875
7 perimeter_mean radius_worst 0.972461
8 radius_mean radius_worst 0.971938
9 radius_se perimeter_se 0.971589
10 radius_mean perimeter_worst 0.967088
11 area_mean area_worst 0.964157
12 area_mean radius_worst 0.958283
13 area_mean perimeter_worst 0.955452
14 perimeter_mean area_worst 0.951121
15 radius_se area_se 0.948731
16 radius_mean area_worst 0.948294
17 perimeter_se area_se 0.942853
18 texture_mean texture_worst 0.923246
19 concave points_mean concave points_worst 0.911806
20 concavity_mean concave points_mean 0.900196

1.5. Data Exploration ¶

1.5.1 Exploratory Data Analysis¶

Exploratory data analysis findings are as follows:

  1. Bivariate analysis identified individual predictors with generally positive association to the target variable based on visual inspection.
  2. A total of 24 of 30 predictors demonstrated higher values that are associated with the diagnosis=M category as compared to measurements under the diagnosis=B category:
    • radius_mean
    • texture_mean
    • perimeter_mean
    • area_mean
    • compactness_mean
    • concavity_mean
    • concave points_mean
    • symmetry_mean
    • radius_se
    • perimeter_se
    • area_se
    • compactness_se
    • concave points_se
    • fractal_dimension_se
    • radius_worst
    • texture_worst
    • perimeter_worst
    • area_worst
    • smoothness_worst
    • compactness_worst
    • concavity_worst
    • concave points_worst
    • symmetry_worst
    • fractal_dimension_worst
In [89]:
##################################
# Segregating the target
# and predictor variables
##################################
breast_cancer_train_predictors_numeric = breast_cancer_train.iloc[:,1:].columns
In [90]:
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = breast_cancer_train_predictors_numeric
In [91]:
##################################
# Segregating the target variable
# and numeric predictors
##################################
boxplot_y_variable = 'diagnosis'
boxplot_x_variables = numeric_variable_name_list.values
In [92]:
##################################
# Defining the number of 
# rows and columns for the subplots
##################################
num_rows = 10
num_cols = 3
In [93]:
##################################
# Formulating the subplot structure
##################################
fig, axes = plt.subplots(num_rows, num_cols, figsize=(20, 40))

##################################
# Flattening the multi-row and
# multi-column axes
##################################
axes = axes.ravel()

##################################
# Formulating the individual boxplots
# for all scaled numeric columns
##################################
for i, x_variable in enumerate(boxplot_x_variables):
    ax = axes[i]
    ax.boxplot([group[x_variable] for name, group in breast_cancer_train.groupby(boxplot_y_variable, observed=True)])
    ax.set_title(f'{boxplot_y_variable} Versus {x_variable}')
    ax.set_xlabel(boxplot_y_variable)
    ax.set_ylabel(x_variable)
    ax.set_xticks(range(1, len(breast_cancer_train[boxplot_y_variable].unique()) + 1), ['B', 'M'])

##################################
# Adjusting the subplot layout
##################################
plt.tight_layout()

##################################
# Presenting the subplots
##################################
plt.show()
No description has been provided for this image

1.5.2 Hypothesis Testing¶

  1. The relationship between the numeric predictors to the diagnosis target variable was statistically evaluated using the following hypotheses:
    • Null: Difference in the means between groups B and M is equal to zero
    • Alternative: Difference in the means between groups B and M is not equal to zero
  2. There is sufficient evidence to conclude of a statistically significant difference between the means of the numeric measurements obtained from B and M groups of the diagnosis target variable in 26 of the 30 numeric predictors given their high t-test statistic values with reported low p-values less than the significance level of 0.05.
    • perimeter_worst: T.Test.Statistic=-23.391, T.Test.PValue=0.000
    • radius_worst: T.Test.Statistic=-23.228, T.Test.PValue=0.000
    • concave points_worst: T.Test.Statistic=-21.365, T.Test.PValue=0.000
    • concave points_mean: T.Test.Statistic=-21.258, T.Test.PValue=0.000
    • area_worst: T.Test.Statistic=-20.310, T.Test.PValue=0.000
    • perimeter_mean: T.Test.Statistic=-20.086, T.Test.PValue=0.000
    • radius_mean: T.Test.Statistic=-19.510, T.Test.PValue=0.000
    • area_mean: T.Test.Statistic=-17.991, T.Test.PValue=0.000
    • concavity_mean: T.Test.Statistic=-15.314, T.Test.PValue=0.026
    • concavity_worst: T.Test.Statistic=-13.368, T.Test.PValue=0.000
    • compactness_mean: T.Test.Statistic=-12.647, T.Test.PValue=0.000
    • compactness_worst: T.Test.Statistic=-12.079, T.Test.PValue=0.000
    • radius_se: T.Test.Statistic=-11.532, T.Test.PValue=0.000
    • perimeter_se: T.Test.Statistic=-11.234, T.Test.PValue=0.000
    • area_se: T.Test.Statistic=-10.375, T.Test.PValue=0.000
    • symmetry_worst: T.Test.Statistic=-8.312, T.Test.PValue=0.000
    • texture_worst: T.Test.Statistic=-7.911, T.Test.PValue=0.000
    • smoothness_worst: T.Test.Statistic=-7.080, T.Test.PValue=0.000
    • texture_mean: T.Test.Statistic=-6.682, T.Test.PValue=0.000
    • concave points_se: T.Test.Statistic=-6.679, T.Test.PValue=0.000
    • symmetry_mean: T.Test.Statistic=-6.315, T.Test.PValue=0.000
    • smoothness_mean: T.Test.Statistic=-6.087, T.Test.PValue=0.000
    • fractal_dimension_worst: T.Test.Statistic=-4.740, T.Test.PValue=0.000
    • compactness_se: T.Test.Statistic=-3.733, T.Test.PValue=0.000
    • concavity_se: T.Test.Statistic=-2.703, T.Test.PValue=0.007
    • smoothness_se: T.Test.Statistic=+2.425, T.Test.PValue=0.015
  3. Feature extraction using Principal Component Analysis was explored to address the high number of correlated predictors noted with high skewness and outlier ratio. The 30 predictors can be potentially reduced to just 10 uncorrelated principal components representing 95% of the original variance.
    • pc_1: Explained_Variance_Ratio=0.426, Cumulative_Explained_Variance=0.426
    • pc_2: Explained_Variance_Ratio=0.189, Cumulative_Explained_Variance=0.615
    • pc_3: Explained_Variance_Ratio=0.101, Cumulative_Explained_Variance=0.717
    • pc_4: Explained_Variance_Ratio=0.068, Cumulative_Explained_Variance=0.786
    • pc_5: Explained_Variance_Ratio=0.058, Cumulative_Explained_Variance=0.845
    • pc_6: Explained_Variance_Ratio=0.042, Cumulative_Explained_Variance=0.887
    • pc_7: Explained_Variance_Ratio=0.022, Cumulative_Explained_Variance=0.910
    • pc_8: Explained_Variance_Ratio=0.016, Cumulative_Explained_Variance=0.926
    • pc_9: Explained_Variance_Ratio=0.014, Cumulative_Explained_Variance=0.941
    • pc_10: Explained_Variance_Ratio=0.011, Cumulative_Explained_Variance=0.953
    • pc_11: Explained_Variance_Ratio=0.010, Cumulative_Explained_Variance=0.963
    • pc_12: Explained_Variance_Ratio=0.008, Cumulative_Explained_Variance=0.972
    • pc_13: Explained_Variance_Ratio=0.007, Cumulative_Explained_Variance=0.979
    • pc_14: Explained_Variance_Ratio=0.004, Cumulative_Explained_Variance=0.984
    • pc_15: Explained_Variance_Ratio=0.002, Cumulative_Explained_Variance=0.986
    • pc_16: Explained_Variance_Ratio=0.002, Cumulative_Explained_Variance=0.989
    • pc_17: Explained_Variance_Ratio=0.001, Cumulative_Explained_Variance=0.991
    • pc_18: Explained_Variance_Ratio=0.001, Cumulative_Explained_Variance=0.993
    • pc_19: Explained_Variance_Ratio=0.001, Cumulative_Explained_Variance=0.994
    • pc_20: Explained_Variance_Ratio=0.001, Cumulative_Explained_Variance=0.995
    • pc_21: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.996
    • pc_22: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.997
    • pc_23: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.998
    • pc_24: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.999
    • pc_25: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.999
    • pc_26: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.999
    • pc_27: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.999
    • pc_28: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.999
    • pc_29: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=0.999
    • pc_30: Explained_Variance_Ratio=0.000, Cumulative_Explained_Variance=1.000
  4. There is sufficient evidence to conclude of a statistically significant difference between the means of the principal component values obtained from B and M groups of the diagnosis target variable in 6 of the 30 principal component predictors given their high t-test statistic values with reported low p-values less than the significance level of 0.05. The 30 predictors can be potentially reduced to at least 3 uncorrelated principal components demonstrating sufficient discrimination.
    • pc_1: T.Test.Statistic=-21.406, T.Test.PValue=0.000
    • pc_2: T.Test.Statistic=+4.080, T.Test.PValue=0.000
    • pc_3: T.Test.Statistic=+3.192, T.Test.PValue=0.015
    • pc_14: T.Test.Statistic=-2.299, T.Test.PValue=0.022
    • pc_17: T.Test.Statistic=+2.256, T.Test.PValue=0.024
    • pc_20: T.Test.Statistic=-2.001, T.Test.PValue=0.046
In [94]:
##################################
# Computing the t-test 
# statistic and p-values
# between the target variable
# and numeric predictor columns
##################################
breast_cancer_train_numeric_ttest_target = {}
breast_cancer_train_numeric = breast_cancer_train.iloc[:,1:]
breast_cancer_train_numeric_columns = breast_cancer_train.iloc[:,1:].columns
for numeric_column in breast_cancer_train_numeric_columns:
    group_B = breast_cancer_train[breast_cancer_train.loc[:,'diagnosis']=='B']
    group_M = breast_cancer_train[breast_cancer_train.loc[:,'diagnosis']=='M']
    breast_cancer_train_numeric_ttest_target['diagnosis_' + numeric_column] = stats.ttest_ind(
        group_B[numeric_column], 
        group_M[numeric_column], 
        equal_var=True)
In [95]:
##################################
# Formulating the pairwise ttest summary
# between the target variable
# and numeric predictor columns
##################################
breast_cancer_train_numeric_hypothesistesting_summary = breast_cancer_train_numeric.from_dict(breast_cancer_train_numeric_ttest_target, orient='index')
breast_cancer_train_numeric_hypothesistesting_summary.columns = ['T.Test.Statistic', 'T.Test.PValue']
display(breast_cancer_train_numeric_hypothesistesting_summary.sort_values(by=['T.Test.PValue'], ascending=True).head(30))
T.Test.Statistic T.Test.PValue
diagnosis_perimeter_worst -23.391423 5.216127e-71
diagnosis_radius_worst -23.228204 2.124527e-70
diagnosis_concave points_worst -21.365587 2.304689e-63
diagnosis_concave points_mean -21.258584 5.896498e-63
diagnosis_area_worst -20.310881 2.507249e-59
diagnosis_perimeter_mean -20.086310 1.830848e-58
diagnosis_radius_mean -19.510552 3.031653e-56
diagnosis_area_mean -17.991971 2.290509e-50
diagnosis_concavity_mean -15.314435 5.174576e-40
diagnosis_concavity_worst -13.368057 1.245191e-32
diagnosis_compactness_mean -12.647550 5.808618e-30
diagnosis_compactness_worst -12.079671 6.827871e-28
diagnosis_radius_se -11.532905 6.238111e-26
diagnosis_perimeter_se -11.234387 7.087958e-25
diagnosis_area_se -10.375886 6.586298e-22
diagnosis_symmetry_worst -8.312820 2.780206e-15
diagnosis_texture_worst -7.911132 4.296038e-14
diagnosis_smoothness_worst -7.080658 9.290923e-12
diagnosis_texture_mean -6.682817 1.055204e-10
diagnosis_concave points_se -6.679983 1.073250e-10
diagnosis_symmetry_mean -6.315327 9.103085e-10
diagnosis_smoothness_mean -6.087615 3.308230e-09
diagnosis_fractal_dimension_worst -4.740955 3.218718e-06
diagnosis_compactness_se -3.733659 2.236727e-04
diagnosis_concavity_se -2.703321 7.235270e-03
diagnosis_smoothness_se 2.425051 1.586462e-02
diagnosis_fractal_dimension_mean 1.513439 1.311644e-01
diagnosis_texture_se 0.432444 6.657128e-01
diagnosis_symmetry_se 0.155224 8.767432e-01
diagnosis_fractal_dimension_se -0.073082 9.417872e-01
In [96]:
##################################
# Exploring a feature extraction approach
# using Principal Component Analysis
# to address the high number of correlated predictors
# noted with high skewness and outlier ratio
##################################
# Standardizing predictors to address
# differences in scaling
##################################
scaler = StandardScaler()
breast_cancer_train_numeric_scaled = scaler.fit_transform(breast_cancer_train_numeric) 
breast_cancer_train_numeric_scaled = pd.DataFrame(breast_cancer_train_numeric_scaled,
                                                  columns=breast_cancer_train_numeric.columns,
                                                  index=breast_cancer_train_numeric.index)
In [97]:
##################################
# Conducting Principal Component Analysis
# on the standardized predictors
##################################
n_components = breast_cancer_train_numeric_scaled.shape[1]
pca = PCA(n_components=n_components, svd_solver='full', random_state=987654321)
breast_cancer_train_numeric_scaled_pcs = pca.fit_transform(breast_cancer_train_numeric_scaled)
In [98]:
##################################
# Consolidating the principal components
# into a dataframe and reattaching
# the diagnosis target column
##################################
pc_cols = [f'pc_{i+1}' for i in range(n_components)]
breast_cancer_train_numeric_scaled_pcs = pd.DataFrame(breast_cancer_train_numeric_scaled_pcs, columns=pc_cols, index=breast_cancer_train_numeric_scaled.index)
breast_cancer_train_pcs = pd.concat([breast_cancer_train[['diagnosis']].copy(), breast_cancer_train_numeric_scaled_pcs], axis=1)
In [99]:
##################################
# Consolidating the explained variance ratio
# for the principal components
##################################
explained_variance_ratio = pca.explained_variance_ratio_
explained_variance_ratio_summary = pd.DataFrame({
    'PC': pc_cols,
    'Explained_Variance_Ratio': explained_variance_ratio,
    'Cumulative_Explained_Variance': np.cumsum(explained_variance_ratio)
}).set_index('PC')
display(explained_variance_ratio_summary)
Explained_Variance_Ratio Cumulative_Explained_Variance
PC
pc_1 0.426228 0.426228
pc_2 0.189411 0.615639
pc_3 0.101749 0.717388
pc_4 0.068995 0.786383
pc_5 0.058895 0.845278
pc_6 0.042254 0.887533
pc_7 0.022768 0.910300
pc_8 0.016543 0.926843
pc_9 0.014899 0.941743
pc_10 0.011865 0.953608
pc_11 0.010183 0.963790
pc_12 0.008323 0.972114
pc_13 0.007802 0.979915
pc_14 0.004232 0.984147
pc_15 0.002850 0.986997
pc_16 0.002469 0.989465
pc_17 0.001967 0.991433
pc_18 0.001811 0.993243
pc_19 0.001471 0.994714
pc_20 0.001133 0.995847
pc_21 0.000952 0.996800
pc_22 0.000891 0.997691
pc_23 0.000713 0.998404
pc_24 0.000599 0.999002
pc_25 0.000480 0.999482
pc_26 0.000242 0.999724
pc_27 0.000203 0.999927
pc_28 0.000044 0.999972
pc_29 0.000024 0.999996
pc_30 0.000004 1.000000
In [100]:
##################################
# Computing the t-test 
# statistic and p-values
# between the target variable
# and principal component predictor columns
##################################
breast_cancer_train_pcs_ttest_target = {}
breast_cancer_train_pcs_numeric = breast_cancer_train_pcs.iloc[:,1:]
breast_cancer_train_pcs_numeric_columns = breast_cancer_train_pcs.iloc[:,1:].columns
for numeric_column in breast_cancer_train_pcs_numeric_columns:
    group_B = breast_cancer_train_pcs[breast_cancer_train_pcs.loc[:,'diagnosis']=='B']
    group_M = breast_cancer_train_pcs[breast_cancer_train_pcs.loc[:,'diagnosis']=='M']
    breast_cancer_train_pcs_ttest_target['diagnosis_' + numeric_column] = stats.ttest_ind(
        group_B[numeric_column], 
        group_M[numeric_column], 
        equal_var=True)
In [101]:
##################################
# Formulating the pairwise ttest summary
# between the target variable
# and principal component predictor columns
##################################
breast_cancer_train_pcs_numeric_hypothesistesting_summary = breast_cancer_train_pcs_numeric.from_dict(breast_cancer_train_pcs_ttest_target, orient='index')
breast_cancer_train_pcs_numeric_hypothesistesting_summary.columns = ['T.Test.Statistic', 'T.Test.PValue']
display(breast_cancer_train_pcs_numeric_hypothesistesting_summary.sort_values(by=['T.Test.PValue'], ascending=True).head(30))
T.Test.Statistic T.Test.PValue
diagnosis_pc_1 -21.406124 1.614914e-63
diagnosis_pc_2 4.080724 5.686808e-05
diagnosis_pc_3 3.192160 1.553738e-03
diagnosis_pc_13 -2.299656 2.211727e-02
diagnosis_pc_17 2.256550 2.471705e-02
diagnosis_pc_20 -2.001077 4.623628e-02
diagnosis_pc_4 -1.925622 5.504581e-02
diagnosis_pc_5 -1.762550 7.893997e-02
diagnosis_pc_14 -1.532393 1.264228e-01
diagnosis_pc_15 1.358293 1.753365e-01
diagnosis_pc_19 1.279015 2.018272e-01
diagnosis_pc_30 1.130236 2.592313e-01
diagnosis_pc_24 1.123771 2.619603e-01
diagnosis_pc_12 -1.010027 3.132526e-01
diagnosis_pc_25 -0.976871 3.293781e-01
diagnosis_pc_8 -0.911846 3.625425e-01
diagnosis_pc_26 0.838271 4.025101e-01
diagnosis_pc_16 -0.650378 5.159193e-01
diagnosis_pc_7 -0.576641 5.645909e-01
diagnosis_pc_23 0.503616 6.148809e-01
diagnosis_pc_29 -0.494381 6.213796e-01
diagnosis_pc_28 0.373462 7.090540e-01
diagnosis_pc_11 -0.362203 7.174415e-01
diagnosis_pc_9 0.261159 7.941393e-01
diagnosis_pc_10 -0.225840 8.214716e-01
diagnosis_pc_18 -0.221574 8.247879e-01
diagnosis_pc_22 -0.201455 8.404724e-01
diagnosis_pc_6 0.155608 8.764409e-01
diagnosis_pc_21 -0.138978 8.895559e-01
diagnosis_pc_27 -0.105049 9.164037e-01

1.6. Premodelling Data Preparation ¶

1.6.1 Preprocessed Data Description¶

  1. Due to the considerable number of predictors noted with high skewness, outlier ratio and multicollinearity, standardization and PCA feature extraction were performed to address issues with distributional shape and pairwise correlation.
    • High skewness observed for 5 variables with Skewness>3 or Skewness<(-3).
      • area_se: Skewness = 6.562
      • concavity_se: Skewness = 5.648
      • fractal_dimension_se: Skewness = 4.280
      • perimeter_se: Skewness = 4.136
      • radius_se: Skewness = 3.775
    • Relatively high number of outliers observed for 7 numeric variables with Outlier.Ratio>0.05.
      • area_se: Outlier.Ratio = 0.110
      • radius_se: Outlier.Ratio = 0.075
      • perimeter_se: Outlier.Ratio = 0.075
      • smoothness_se: Outlier.Ratio = 0.059
      • compactness_se: Outlier.Ratio = 0.059
      • fractal_dimension_se: Outlier.Ratio = 0.056
      • symmetry_se: Outlier.Ratio = 0.050
    • High Pearson.Correlation values > 0.90 were noted for 4.60% (20/435) of the pairwise combinations of predictors:
      • radius_mean and perimeter_mean: Pearson.Correlation = 0.997
      • radius_worst and perimeter_worst: Pearson.Correlation = 0.993
      • perimeter_mean and area_mean: Pearson.Correlation = 0.985
      • radius_mean and area_mean: Pearson.Correlation = 0.984
      • radius_worst and area_worst: Pearson.Correlation = 0.982
      • perimeter_worst and area_worst: Pearson.Correlation = 0.978
      • perimeter_mean and perimeter_worst: Pearson.Correlation = 0.972
      • perimeter_mean and radius_worst: Pearson.Correlation = 0.972
      • radius_mean and radius_worst: Pearson.Correlation = 0.971
      • radius_se and perimeter_se: Pearson.Correlation = 0.971
      • radius_mean and perimeter_worst: Pearson.Correlation = 0.967
      • area_mean and area_worst: Pearson.Correlation = 0.964
      • area_mean and radius_worst: Pearson.Correlation = 0.958
      • area_mean and perimeter_worst: Pearson.Correlation = 0.955
      • perimeter_mean and area_worst: Pearson.Correlation = 0.951
      • radius_se and area_se: Pearson.Correlation = 0.948
      • radius_mean and area_worst: Pearson.Correlation = 0.948
      • perimeter_se and area_se: Pearson.Correlation = 0.942
      • texture_mean and texture_worst: Pearson.Correlation = 0.923
      • concave points_mean and concave points_worst: Pearson.Correlation = 0.911
      • concavity_mean and concave points_mean: Pearson.Correlation = 0.900
  2. Based on the assessment of cumulative explained variance and discrimination power of the extracted principal components, the number of predictors can range from 3 to 10.
  3. To enable diversity among predictors, 10 principal components were used for the downstream modeling process.
  4. The preprocessed train dataset (final) is comprised of:
    • 319 rows (observations)
      • 200 diagnosis=B: 62.69%
      • 119 diagnosis=M: 37.30%
    • 11 columns (variables)
      • 1/11 target (categorical)
        • diagnosis
      • 10/11 predictor (numeric)
        • pc_1
        • pc_2
        • pc_3
        • pc_4
        • pc_5
        • pc_6
        • pc_7
        • pc_8
        • pc_9
        • pc_10

1.6.2 Preprocessing Pipeline Development¶

  1. A preprocessing pipeline was formulated and applied to the train data (final), validation data and test data with the following actions:
    • Applied standardization to address difference in scales among the predictors
    • Performed data extraction using Principal Component Analysis of the scaled predictors
    • Filtered the predictors to the top 10 principal components
In [102]:
##################################
# Formulating a preprocessing pipeline
# that performs standardization,
# performs feature extraction using PCA, and
# filtering the first 10 principal components as predictors
##################################
def preprocess_dataset(train_df: pd.DataFrame, 
                       evaluation_df: pd.DataFrame, 
                       n_components: int = 10, 
                       random_state: int = 987654321) -> pd.DataFrame:  
    # Splitting the target and predictor columns
    target_col = train_df.columns[0]
    X_train = train_df.iloc[:, 1:]
    y_train = train_df.iloc[:, 0]
    X_test = evaluation_df.iloc[:, 1:]
    y_test = evaluation_df.iloc[:, 0]

    # Fitting StandardScaler on training data and transforming both training and evaluation data
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Fitting PCA on training data on training data and transforming both training and evaluation data
    pca = PCA(n_components=min(n_components, X_train.shape[1]), random_state=random_state)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    # Preparing the output DataFrame for the evaluation data
    pc_cols = [f'pc_{i+1}' for i in range(X_test_pca.shape[1])]
    scaled_pcatransformed_evaluation_df = pd.DataFrame(X_test_pca, columns=pc_cols, index=evaluation_df.index)

    # Add target column back as first column
    scaled_pcatransformed_evaluation_df.insert(0, target_col, y_test.values)

    # Printing variance explained for reference
    explained_var = np.cumsum(pca.explained_variance_ratio_)
    print(f"Explained Variance (First {n_components} PCs): {explained_var[-1]:.4f}")

    return scaled_pcatransformed_evaluation_df
In [103]:
##################################
# Applying the preprocessing pipeline
# to the train data
##################################
breast_cancer_preprocessed_train = preprocess_dataset(breast_cancer_train, breast_cancer_train, 10, 987654321)
X_preprocessed_train = breast_cancer_preprocessed_train.drop('diagnosis', axis = 1)
y_preprocessed_train = breast_cancer_preprocessed_train['diagnosis']
breast_cancer_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_PATH, "breast_cancer_preprocessed_train.csv"), index=False)
X_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_FEATURES_PATH, "X_preprocessed_train.csv"), index=False)
y_preprocessed_train.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TRAIN_TARGET_PATH, "y_preprocessed_train.csv"), index=False)
print('Final Preprocessed Train Dataset Dimensions: ')
display(X_preprocessed_train.shape)
display(y_preprocessed_train.shape)
print('Final Preprocessed Train Target Variable Breakdown: ')
display(y_preprocessed_train.value_counts())
print('Final Preprocessed Train Target Variable Proportion: ')
display(y_preprocessed_train.value_counts(normalize = True))
breast_cancer_preprocessed_train.head()
Explained Variance (First 10 PCs): 0.9536
Final Preprocessed Train Dataset Dimensions: 
(319, 10)
(319,)
Final Preprocessed Train Target Variable Breakdown: 
diagnosis
B    200
M    119
Name: count, dtype: int64
Final Preprocessed Train Target Variable Proportion: 
diagnosis
B    0.626959
M    0.373041
Name: proportion, dtype: float64
Out[103]:
diagnosis pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 pc_7 pc_8 pc_9 pc_10
id
868826 M 3.729203 0.987215 3.540855 -2.064283 2.512443 1.936519 0.697969 0.871868 0.642028 -1.833888
8810703 M 12.079158 -6.698169 10.242397 -5.434204 3.701610 -1.501518 -4.413311 1.612258 1.425855 -1.835477
906878 B -0.311673 0.128320 -1.056912 0.070388 -1.547663 0.331599 0.032196 -0.533350 0.293836 0.071285
911654 B -0.474681 -0.957130 -0.280827 0.354585 -1.590079 -0.326743 -0.120392 -0.328281 -0.094953 -0.681747
903483 B -3.766843 2.522881 1.905036 -0.056397 2.901107 -1.592187 -1.428407 0.134134 -0.774598 1.244052
In [104]:
##################################
# Applying the preprocessing pipeline
# to the validation data
##################################
breast_cancer_preprocessed_validation = preprocess_dataset(breast_cancer_validation, breast_cancer_validation, 10, 987654321)
X_preprocessed_validation = breast_cancer_preprocessed_validation.drop('diagnosis', axis = 1)
y_preprocessed_validation = breast_cancer_preprocessed_validation['diagnosis']
breast_cancer_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_PATH, "breast_cancer_preprocessed_validation.csv"), index=False)
X_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_FEATURES_PATH, "X_preprocessed_validation.csv"), index=False)
y_preprocessed_validation.to_csv(os.path.join("..", DATASETS_PREPROCESSED_VALIDATION_TARGET_PATH, "y_preprocessed_validation.csv"), index=False)
print('Final Preprocessed Validation Dataset Dimensions: ')
display(X_preprocessed_validation.shape)
display(y_preprocessed_validation.shape)
print('Final Preprocessed Validation Target Variable Breakdown: ')
display(y_preprocessed_validation.value_counts())
print('Final Preprocessed Validation Target Variable Proportion: ')
display(y_preprocessed_validation.value_counts(normalize = True))
breast_cancer_preprocessed_validation.head()
Explained Variance (First 10 PCs): 0.9658
Final Preprocessed Validation Dataset Dimensions: 
(107, 10)
(107,)
Final Preprocessed Validation Target Variable Breakdown: 
diagnosis
B    67
M    40
Name: count, dtype: int64
Final Preprocessed Validation Target Variable Proportion: 
diagnosis
B    0.626168
M    0.373832
Name: proportion, dtype: float64
Out[104]:
diagnosis pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 pc_7 pc_8 pc_9 pc_10
id
86355 M 13.035175 0.217957 2.105837 -0.636468 0.051561 -1.807528 -0.025319 0.404616 -0.454300 -1.499024
884948 M 7.208194 -2.366385 1.928770 0.199315 -0.748175 -1.513559 0.265762 -0.764441 0.007565 0.402246
915276 B 1.300337 8.300252 -0.043626 -1.908086 -1.499821 3.112851 -0.717757 0.515100 1.345107 -0.311807
858970 B -2.512677 3.300052 1.674471 -2.190322 3.044812 -1.144982 0.227182 -0.581862 -0.484832 1.131556
898677 B -2.418011 4.124441 2.878352 -0.155380 -0.288107 0.993082 -0.246339 1.222199 2.012470 -0.674194
In [105]:
##################################
# Applying the preprocessing pipeline
# to the test data
##################################
breast_cancer_preprocessed_test = preprocess_dataset(breast_cancer_test, breast_cancer_test, 10, 987654321)
X_preprocessed_test = breast_cancer_preprocessed_test.drop('diagnosis', axis = 1)
y_preprocessed_test = breast_cancer_preprocessed_test['diagnosis']
breast_cancer_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_PATH, "breast_cancer_preprocessed_test.csv"), index=False)
X_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_FEATURES_PATH, "X_preprocessed_test.csv"), index=False)
y_preprocessed_test.to_csv(os.path.join("..", DATASETS_PREPROCESSED_TEST_TARGET_PATH, "y_preprocessed_test.csv"), index=False)
print('Final Preprocessed Test Dataset Dimensions: ')
display(X_preprocessed_test.shape)
display(y_preprocessed_test.shape)
print('Final Preprocessed Test Target Variable Breakdown: ')
display(y_preprocessed_test.value_counts())
print('Final Preprocessed Test Target Variable Proportion: ')
display(y_preprocessed_test.value_counts(normalize = True))
breast_cancer_preprocessed_test.head()
Explained Variance (First 10 PCs): 0.9630
Final Preprocessed Test Dataset Dimensions: 
(143, 10)
(143,)
Final Preprocessed Test Target Variable Breakdown: 
diagnosis
B    90
M    53
Name: count, dtype: int64
Final Preprocessed Test Target Variable Proportion: 
diagnosis
B    0.629371
M    0.370629
Name: proportion, dtype: float64
Out[105]:
diagnosis pc_1 pc_2 pc_3 pc_4 pc_5 pc_6 pc_7 pc_8 pc_9 pc_10
id
848406 M 0.203287 -1.498700 -0.973630 0.810168 0.458344 0.704048 0.268294 0.004397 0.546047 -0.413089
858981 B -2.363761 3.025143 1.519950 0.627623 2.306716 1.541578 -0.148369 -0.031751 -0.071823 -1.159295
88350402 B -2.316578 -1.273185 -0.261651 -1.193922 -0.203169 0.076551 0.687459 -0.161819 0.152953 -0.160444
9112594 B -3.134608 -1.944446 -0.040192 2.182643 0.277373 0.231880 0.295401 -0.048081 -0.121538 0.193050
86409 B 4.139336 3.702540 2.670982 -0.154971 -5.773728 -1.251681 -1.610567 1.354328 -0.115852 -0.220181
In [106]:
##################################
# Defining a function to compute
# model performance
##################################
def model_performance_evaluation(y_true, y_pred):
    metric_name = ['Accuracy','Precision','Recall','F1','AUROC']
    metric_value = [accuracy_score(y_true, y_pred),
                   precision_score(y_true, y_pred),
                   recall_score(y_true, y_pred),
                   f1_score(y_true, y_pred),
                   roc_auc_score(y_true, y_pred)]    
    metric_summary = pd.DataFrame(zip(metric_name, metric_value),
                                  columns=['metric_name','metric_value']) 
    return(metric_summary)

1.7. Model Development and Validation ¶

1.7.1 Random Forest¶

Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to improve prediction accuracy and robustness in binary classification. Instead of relying on a single decision tree, it aggregates multiple trees, reducing overfitting and increasing generalizability. The algorithm works by training individual decision trees on bootstrapped samples of the dataset, where each tree is trained on a slightly different subset of data. Additionally, at each decision node, a random subset of features is considered for splitting, adding further diversity among the trees. The final classification is determined by majority voting across all trees. The main advantages of Random Forest include its resilience to overfitting, ability to handle high-dimensional data, and robustness against noisy data. However, it has limitations, such as higher computational cost due to multiple trees and reduced interpretability compared to a single decision tree. It can also struggle with highly imbalanced data unless additional techniques like class weighting are applied.

  1. The random forest model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • criterion = function to measure the quality of a split made to vary between gini and entropy
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
    • max_features = number of features to consider when looking for the best split made to vary between 7 and 9
  3. A special hyperparameter (class_weight = balanced) was fixed to address the minimal 1.7:1 class imbalance observed between the B and M diagnosis categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • criterion = entropy
    • max_depth = 5
    • min_samples_leaf = 9
    • max_features = 5
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9749
    • Precision = 0.9743
    • Recall = 0.9579
    • F1 Score = 0.9661
    • AUROC = 0.9714
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9345
    • Precision = 0.9714
    • Recall = 0.8500
    • F1 Score = 0.9066
    • AUROC = 0.9175
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [107]:
##################################
# Defining the missing value imputation, scaling and PCA preprocessing parameters
##################################
scaling_pca_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),           
    ('pca', PCA(n_components=10, random_state=987654321))  
])
In [108]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
bagged_rf_pipeline = Pipeline([
    ('scaling_pca_preprocessor', scaling_pca_preprocessor),
    ('bagged_rf_model', RandomForestClassifier(
        class_weight='balanced',
        random_state=987654321))
])
In [109]:
##################################
# Defining hyperparameter grid
##################################
bagged_rf_hyperparameter_grid = {
    'bagged_rf_model__criterion': ['gini', 'entropy'],
    'bagged_rf_model__max_depth': [3, 5],
    'bagged_rf_model__min_samples_leaf': [5, 10],
    'bagged_rf_model__max_features': [7, 9]
}
In [110]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [111]:
##################################
# Performing Grid Search with cross-validation
##################################
bagged_rf_grid_search = GridSearchCV(
    estimator=bagged_rf_pipeline,
    param_grid=bagged_rf_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [112]:
##################################
# Encoding the response variables
# for model training and validation
##################################
y_train_encoded = y_train.map({'B': 0, 'M': 1})
y_validation_encoded = y_validation.map({'B': 0, 'M': 1})
In [113]:
##################################
# Fitting GridSearchCV
##################################
bagged_rf_grid_search.fit(X_train, y_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[113]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('scaling_pca_preprocessor',
                                        Pipeline(steps=[('imputer',
                                                         SimpleImputer(strategy='median')),
                                                        ('scaler',
                                                         StandardScaler()),
                                                        ('pca',
                                                         PCA(n_components=10,
                                                             random_state=987654321))])),
                                       ('bagged_rf_model',
                                        RandomForestClassifier(class_weight='balanced',
                                                               random_state=987654321))]),
             n_jobs=-1,
             param_grid={'bagged_rf_model__criterion': ['gini', 'entropy'],
                         'bagged_rf_model__max_depth': [3, 5],
                         'bagged_rf_model__max_features': [7, 9],
                         'bagged_rf_model__min_samples_leaf': [5, 10]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step...=987654321))])
param_grid  {'bagged_rf_model__criterion': ['gini', 'entropy'], 'bagged_rf_model__max_depth': [3, 5], 'bagged_rf_model__max_features': [7, 9], 'bagged_rf_model__min_samples_leaf': [5, 10]}
scoring  'f1'
n_jobs  -1
refit  True
cv  RepeatedStrat...ate=987654321)
verbose  1
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
Parameters
steps  [('imputer', ...), ('scaler', ...), ...]
transform_input  None
memory  None
verbose  False
Parameters
missing_values  nan
strategy  'median'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
Parameters
n_components  10
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  987654321
Parameters
n_estimators  100
criterion  'entropy'
max_depth  5
min_samples_split  2
min_samples_leaf  5
min_weight_fraction_leaf  0.0
max_features  9
max_leaf_nodes  None
min_impurity_decrease  0.0
bootstrap  True
oob_score  False
n_jobs  None
random_state  987654321
verbose  0
warm_start  False
class_weight  'balanced'
ccp_alpha  0.0
max_samples  None
monotonic_cst  None
In [114]:
##################################
# Identifying the best model
##################################
bagged_rf_optimal = bagged_rf_grid_search.best_estimator_
In [115]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
bagged_rf_optimal_f1_cv = bagged_rf_grid_search.best_score_
bagged_rf_optimal_f1_train = f1_score(y_train_encoded, bagged_rf_optimal.predict(X_train))
bagged_rf_optimal_f1_validation = f1_score(y_validation_encoded, bagged_rf_optimal.predict(X_validation))
In [116]:
##################################
# Identifying the optimal model
##################################
print('Best Bagged Model - Random Forest: ')
print(f"Best Random Forest Hyperparameters: {bagged_rf_grid_search.best_params_}")
Best Bagged Model - Random Forest: 
Best Random Forest Hyperparameters: {'bagged_rf_model__criterion': 'entropy', 'bagged_rf_model__max_depth': 5, 'bagged_rf_model__max_features': 9, 'bagged_rf_model__min_samples_leaf': 5}
In [117]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {bagged_rf_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {bagged_rf_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_train_encoded, bagged_rf_optimal.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9121
F1 Score on Training Data: 0.9661

Classification Report on Train Data:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       200
           1       0.97      0.96      0.97       119

    accuracy                           0.97       319
   macro avg       0.97      0.97      0.97       319
weighted avg       0.97      0.97      0.97       319

In [118]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_train_encoded, bagged_rf_optimal.predict(X_train))
cm_normalized = confusion_matrix(y_train_encoded, bagged_rf_optimal.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [119]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {bagged_rf_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation_encoded, bagged_rf_optimal.predict(X_validation)))
F1 Score on Validation Data: 0.9067

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.92      0.99      0.95        67
           1       0.97      0.85      0.91        40

    accuracy                           0.93       107
   macro avg       0.94      0.92      0.93       107
weighted avg       0.94      0.93      0.93       107

In [120]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation_encoded, bagged_rf_optimal.predict(X_validation))
cm_normalized = confusion_matrix(y_validation_encoded, bagged_rf_optimal.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Random Forest Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [121]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
bagged_rf_optimal_train = model_performance_evaluation(y_train_encoded, bagged_rf_optimal.predict(X_train))
bagged_rf_optimal_train['model'] = ['bagged_rf_optimal'] * 5
bagged_rf_optimal_train['set'] = ['train'] * 5
print('Optimal Random Forest Train Performance Metrics: ')
display(bagged_rf_optimal_train)
Optimal Random Forest Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.974922 bagged_rf_optimal train
1 Precision 0.974359 bagged_rf_optimal train
2 Recall 0.957983 bagged_rf_optimal train
3 F1 0.966102 bagged_rf_optimal train
4 AUROC 0.971492 bagged_rf_optimal train
In [122]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
bagged_rf_optimal_validation = model_performance_evaluation(y_validation_encoded, bagged_rf_optimal.predict(X_validation))
bagged_rf_optimal_validation['model'] = ['bagged_rf_optimal'] * 5
bagged_rf_optimal_validation['set'] = ['validation'] * 5
print('Optimal Random Forest Validation Performance Metrics: ')
display(bagged_rf_optimal_validation)
Optimal Random Forest Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.934579 bagged_rf_optimal validation
1 Precision 0.971429 bagged_rf_optimal validation
2 Recall 0.850000 bagged_rf_optimal validation
3 F1 0.906667 bagged_rf_optimal validation
4 AUROC 0.917537 bagged_rf_optimal validation
In [123]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(bagged_rf_optimal, 
            os.path.join("..", MODELS_PATH, "bagged_model_random_forest_optimal.pkl"))
Out[123]:
['..\\models\\bagged_model_random_forest_optimal.pkl']

1.7.2 AdaBoost¶

AdaBoost (Adaptive Boosting) is a boosting technique that combines multiple weak learners — typically decision stumps (shallow trees) — to form a strong classifier. It works by iteratively training weak models, assigning higher weights to misclassified instances so that subsequent models focus on difficult cases. At each iteration, a new weak model is trained, and its predictions are combined using a weighted voting mechanism. This process continues until a stopping criterion is met, such as a predefined number of iterations or performance threshold. AdaBoost is advantageous because it improves accuracy without overfitting if regularized properly. It performs well with clean data and can transform weak classifiers into strong ones. However, it is sensitive to noisy data and outliers, as misclassified points receive higher importance, leading to potential overfitting. Additionally, training can be slow for large datasets, and performance depends on the choice of base learner, typically decision trees.

  1. The adaboost model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 3 hyperparameters for tuning:
    • estimator_max_depth = maximum depth of the tree made to vary between 1 and 2
    • learning_rate = weight applied to each classifier at each boosting iteration made to vary between 0.01 and 0.10
    • n_estimators = maximum number of estimators at which boosting is terminated made to vary between 50 and 100
  3. No any hyperparameter was defined in the model address the minimal 1.7:1 class imbalance observed between the B and M diagnosis categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • estimator_max_depth = 2
    • learning_rate = 0.10
    • n_estimators = 100
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9937
    • Precision = 1.0000
    • Recall = 0.9831
    • F1 Score = 0.9915
    • AUROC = 0.9915
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9439
    • Precision = 0.9722
    • Recall = 0.8750
    • F1 Score = 0.9210
    • AUROC = 0.9300
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [124]:
##################################
# Defining the missing value imputation, scaling and PCA preprocessing parameters
##################################
scaling_pca_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),           
    ('pca', PCA(n_components=10, random_state=987654321))  
])
In [125]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_ab_pipeline = Pipeline([
    ('scaling_pca_preprocessor', scaling_pca_preprocessor),
    ('boosted_ab_model', AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321),
                                            random_state=987654321))
])
In [126]:
##################################
# Defining hyperparameter grid
##################################
boosted_ab_hyperparameter_grid = {
    'boosted_ab_model__learning_rate': [0.01, 0.10],  
    'boosted_ab_model__estimator__max_depth': [1, 2],
    'boosted_ab_model__n_estimators': [50, 100]
}
In [127]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [128]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_ab_grid_search = GridSearchCV(
    estimator=boosted_ab_pipeline,
    param_grid=boosted_ab_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [129]:
##################################
# Encoding the response variables
# for model training and validation
##################################
y_train_encoded = y_train.map({'B': 0, 'M': 1})
y_validation_encoded = y_validation.map({'B': 0, 'M': 1})
In [130]:
##################################
# Fitting GridSearchCV
##################################
boosted_ab_grid_search.fit(X_train, y_train_encoded)
Fitting 25 folds for each of 8 candidates, totalling 200 fits
Out[130]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('scaling_pca_preprocessor',
                                        Pipeline(steps=[('imputer',
                                                         SimpleImputer(strategy='median')),
                                                        ('scaler',
                                                         StandardScaler()),
                                                        ('pca',
                                                         PCA(n_components=10,
                                                             random_state=987654321))])),
                                       ('boosted_ab_model',
                                        AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=987654321),
                                                           random_state=987654321))]),
             n_jobs=-1,
             param_grid={'boosted_ab_model__estimator__max_depth': [1, 2],
                         'boosted_ab_model__learning_rate': [0.01, 0.1],
                         'boosted_ab_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step...=987654321))])
param_grid  {'boosted_ab_model__estimator__max_depth': [1, 2], 'boosted_ab_model__learning_rate': [0.01, 0.1], 'boosted_ab_model__n_estimators': [50, 100]}
scoring  'f1'
n_jobs  -1
refit  True
cv  RepeatedStrat...ate=987654321)
verbose  1
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
Parameters
steps  [('imputer', ...), ('scaler', ...), ...]
transform_input  None
memory  None
verbose  False
Parameters
missing_values  nan
strategy  'median'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
Parameters
n_components  10
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  987654321
Parameters
estimator  DecisionTreeC...ate=987654321)
n_estimators  100
learning_rate  0.1
algorithm  'deprecated'
random_state  987654321
DecisionTreeClassifier(max_depth=2, random_state=987654321)
Parameters
criterion  'gini'
splitter  'best'
max_depth  2
min_samples_split  2
min_samples_leaf  1
min_weight_fraction_leaf  0.0
max_features  None
random_state  987654321
max_leaf_nodes  None
min_impurity_decrease  0.0
class_weight  None
ccp_alpha  0.0
monotonic_cst  None
In [131]:
##################################
# Identifying the best model
##################################
boosted_ab_optimal = boosted_ab_grid_search.best_estimator_
In [132]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_ab_optimal_f1_cv = boosted_ab_grid_search.best_score_
boosted_ab_optimal_f1_train = f1_score(y_train_encoded, boosted_ab_optimal.predict(X_train))
boosted_ab_optimal_f1_validation = f1_score(y_validation_encoded, boosted_ab_optimal.predict(X_validation))
In [133]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - AdaBoost: ')
print(f"Best AdaBoost Hyperparameters: {boosted_ab_grid_search.best_params_}")
Best Boosted Model - AdaBoost: 
Best AdaBoost Hyperparameters: {'boosted_ab_model__estimator__max_depth': 2, 'boosted_ab_model__learning_rate': 0.1, 'boosted_ab_model__n_estimators': 100}
In [134]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_ab_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_ab_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_train_encoded, boosted_ab_optimal.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9280
F1 Score on Training Data: 0.9915

Classification Report on Train Data:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00       200
           1       1.00      0.98      0.99       119

    accuracy                           0.99       319
   macro avg       1.00      0.99      0.99       319
weighted avg       0.99      0.99      0.99       319

In [135]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_train_encoded, boosted_ab_optimal.predict(X_train))
cm_normalized = confusion_matrix(y_train_encoded, boosted_ab_optimal.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [136]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_ab_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation_encoded, boosted_ab_optimal.predict(X_validation)))
F1 Score on Validation Data: 0.9211

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.93      0.99      0.96        67
           1       0.97      0.88      0.92        40

    accuracy                           0.94       107
   macro avg       0.95      0.93      0.94       107
weighted avg       0.95      0.94      0.94       107

In [137]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation_encoded, boosted_ab_optimal.predict(X_validation))
cm_normalized = confusion_matrix(y_validation_encoded, boosted_ab_optimal.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal AdaBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [138]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_ab_optimal_train = model_performance_evaluation(y_train_encoded, boosted_ab_optimal.predict(X_train))
boosted_ab_optimal_train['model'] = ['boosted_ab_optimal'] * 5
boosted_ab_optimal_train['set'] = ['train'] * 5
print('Optimal AdaBoost Train Performance Metrics: ')
display(boosted_ab_optimal_train)
Optimal AdaBoost Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.993730 boosted_ab_optimal train
1 Precision 1.000000 boosted_ab_optimal train
2 Recall 0.983193 boosted_ab_optimal train
3 F1 0.991525 boosted_ab_optimal train
4 AUROC 0.991597 boosted_ab_optimal train
In [139]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_ab_optimal_validation = model_performance_evaluation(y_validation_encoded, boosted_ab_optimal.predict(X_validation))
boosted_ab_optimal_validation['model'] = ['boosted_ab_optimal'] * 5
boosted_ab_optimal_validation['set'] = ['validation'] * 5
print('Optimal AdaBoost Validation Performance Metrics: ')
display(boosted_ab_optimal_validation)
Optimal AdaBoost Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.943925 boosted_ab_optimal validation
1 Precision 0.972222 boosted_ab_optimal validation
2 Recall 0.875000 boosted_ab_optimal validation
3 F1 0.921053 boosted_ab_optimal validation
4 AUROC 0.930037 boosted_ab_optimal validation
In [140]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_ab_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_adaboost_optimal.pkl"))
Out[140]:
['..\\models\\boosted_model_adaboost_optimal.pkl']

1.7.3 Gradient Boosting¶

Gradient Boosting builds an ensemble of decision trees sequentially, where each new tree corrects the mistakes of the previous ones by optimizing a loss function. Unlike AdaBoost, which reweights misclassified instances, Gradient Boosting fits each new tree to the residual errors of the previous model, gradually improving predictions. This process continues until a stopping criterion, such as a set number of trees, is met. The key advantages of Gradient Boosting include its flexibility to model complex relationships and strong predictive performance, often outperforming bagging methods. It can handle both numeric and categorical data well. However, it is prone to overfitting if not carefully tuned, especially with deep trees and too many iterations. It is also computationally expensive due to sequential training, and hyperparameter tuning (e.g., learning rate, number of trees, tree depth) can be challenging and time-consuming.

  1. The gradient boosting model from the sklearn.ensemble Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = shrinking proportion of the contribution from each tree made to vary between 0.01 and 0.10
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • min_samples_leaf = minimum number of samples required to be at a leaf node made to vary between 5 and 10
    • n_estimators = number of boosting stages to perform made to vary between 50 and 100
  3. No any hyperparameter was defined in the model to address the minimal 1.7:1 class imbalance observed between the B and M diagnosis categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.10
    • max_depth = 3
    • min_samples_leaf = 10
    • n_estimators = 100
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 1.0000
    • Precision = 1.0000
    • Recall = 1.0000
    • F1 Score = 1.0000
    • AUROC = 1.0000
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9345
    • Precision = 0.9714
    • Recall = 0.8500
    • F1 Score = 0.9066
    • AUROC = 0.9175
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [141]:
##################################
# Defining the missing value imputation, scaling and PCA preprocessing parameters
##################################
scaling_pca_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),           
    ('pca', PCA(n_components=10, random_state=987654321))  
])
In [142]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_gb_pipeline = Pipeline([
    ('scaling_pca_preprocessor', scaling_pca_preprocessor),
    ('boosted_gb_model', GradientBoostingClassifier(n_iter_no_change=10,
                                                    validation_fraction=0.1,
                                                    tol=1e-4,
                                                    random_state=987654321))
])
In [143]:
##################################
# Defining hyperparameter grid
##################################
boosted_gb_hyperparameter_grid = {
    'boosted_gb_model__learning_rate': [0.01, 0.10],
    'boosted_gb_model__max_depth': [3, 6], 
    'boosted_gb_model__min_samples_leaf': [5, 10],
    'boosted_gb_model__n_estimators': [50, 100] 
}
In [144]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [145]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_gb_grid_search = GridSearchCV(
    estimator=boosted_gb_pipeline,
    param_grid=boosted_gb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [146]:
##################################
# Encoding the response variables
# for model training and validation
##################################
y_train_encoded = y_train.map({'B': 0, 'M': 1})
y_validation_encoded = y_validation.map({'B': 0, 'M': 1})
In [147]:
##################################
# Fitting GridSearchCV
##################################
boosted_gb_grid_search.fit(X_train, y_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[147]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('scaling_pca_preprocessor',
                                        Pipeline(steps=[('imputer',
                                                         SimpleImputer(strategy='median')),
                                                        ('scaler',
                                                         StandardScaler()),
                                                        ('pca',
                                                         PCA(n_components=10,
                                                             random_state=987654321))])),
                                       ('boosted_gb_model',
                                        GradientBoostingClassifier(n_iter_no_change=10,
                                                                   random_state=987654321))]),
             n_jobs=-1,
             param_grid={'boosted_gb_model__learning_rate': [0.01, 0.1],
                         'boosted_gb_model__max_depth': [3, 6],
                         'boosted_gb_model__min_samples_leaf': [5, 10],
                         'boosted_gb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step...=987654321))])
param_grid  {'boosted_gb_model__learning_rate': [0.01, 0.1], 'boosted_gb_model__max_depth': [3, 6], 'boosted_gb_model__min_samples_leaf': [5, 10], 'boosted_gb_model__n_estimators': [50, 100]}
scoring  'f1'
n_jobs  -1
refit  True
cv  RepeatedStrat...ate=987654321)
verbose  1
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
Parameters
steps  [('imputer', ...), ('scaler', ...), ...]
transform_input  None
memory  None
verbose  False
Parameters
missing_values  nan
strategy  'median'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
Parameters
n_components  10
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  987654321
Parameters
loss  'log_loss'
learning_rate  0.1
n_estimators  100
subsample  1.0
criterion  'friedman_mse'
min_samples_split  2
min_samples_leaf  10
min_weight_fraction_leaf  0.0
max_depth  3
min_impurity_decrease  0.0
init  None
random_state  987654321
max_features  None
verbose  0
max_leaf_nodes  None
warm_start  False
validation_fraction  0.1
n_iter_no_change  10
tol  0.0001
ccp_alpha  0.0
In [148]:
##################################
# Identifying the best model
##################################
boosted_gb_optimal = boosted_gb_grid_search.best_estimator_
In [149]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_gb_optimal_f1_cv = boosted_gb_grid_search.best_score_
boosted_gb_optimal_f1_train = f1_score(y_train_encoded, boosted_gb_optimal.predict(X_train))
boosted_gb_optimal_f1_validation = f1_score(y_validation_encoded, boosted_gb_optimal.predict(X_validation))
In [150]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Gradient Boosting: ')
print(f"Best Gradient Boosting Hyperparameters: {boosted_gb_grid_search.best_params_}")
Best Boosted Model - Gradient Boosting: 
Best Gradient Boosting Hyperparameters: {'boosted_gb_model__learning_rate': 0.1, 'boosted_gb_model__max_depth': 3, 'boosted_gb_model__min_samples_leaf': 10, 'boosted_gb_model__n_estimators': 100}
In [151]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_gb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_gb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_train_encoded, boosted_gb_optimal.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9330
F1 Score on Training Data: 1.0000

Classification Report on Train Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       1.00      1.00      1.00       119

    accuracy                           1.00       319
   macro avg       1.00      1.00      1.00       319
weighted avg       1.00      1.00      1.00       319

In [152]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_train_encoded, boosted_gb_optimal.predict(X_train))
cm_normalized = confusion_matrix(y_train_encoded, boosted_gb_optimal.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [153]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_gb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation_encoded, boosted_gb_optimal.predict(X_validation)))
F1 Score on Validation Data: 0.9067

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.92      0.99      0.95        67
           1       0.97      0.85      0.91        40

    accuracy                           0.93       107
   macro avg       0.94      0.92      0.93       107
weighted avg       0.94      0.93      0.93       107

In [154]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation_encoded, boosted_gb_optimal.predict(X_validation))
cm_normalized = confusion_matrix(y_validation_encoded, boosted_gb_optimal.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Gradient Boosting Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [155]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_gb_optimal_train = model_performance_evaluation(y_train_encoded, boosted_gb_optimal.predict(X_train))
boosted_gb_optimal_train['model'] = ['boosted_gb_optimal'] * 5
boosted_gb_optimal_train['set'] = ['train'] * 5
print('Optimal Gradient Boosting Train Performance Metrics: ')
display(boosted_gb_optimal_train)
Optimal Gradient Boosting Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 1.0 boosted_gb_optimal train
1 Precision 1.0 boosted_gb_optimal train
2 Recall 1.0 boosted_gb_optimal train
3 F1 1.0 boosted_gb_optimal train
4 AUROC 1.0 boosted_gb_optimal train
In [156]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_gb_optimal_validation = model_performance_evaluation(y_validation_encoded, boosted_gb_optimal.predict(X_validation))
boosted_gb_optimal_validation['model'] = ['boosted_gb_optimal'] * 5
boosted_gb_optimal_validation['set'] = ['validation'] * 5
print('Optimal Gradient Boosting Validation Performance Metrics: ')
display(boosted_gb_optimal_validation)
Optimal Gradient Boosting Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.934579 boosted_gb_optimal validation
1 Precision 0.971429 boosted_gb_optimal validation
2 Recall 0.850000 boosted_gb_optimal validation
3 F1 0.906667 boosted_gb_optimal validation
4 AUROC 0.917537 boosted_gb_optimal validation
In [157]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_gb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_gradient_boosting_optimal.pkl"))
Out[157]:
['..\\models\\boosted_model_gradient_boosting_optimal.pkl']

1.7.4 XGBoost¶

XGBoost (Extreme Gradient Boosting) is an optimized version of Gradient Boosting that introduces additional regularization and computational efficiencies. It builds decision trees sequentially, with each new tree correcting the residual errors of the previous ones, but it incorporates advanced techniques such as shrinkage (learning rate), column subsampling, and L1/L2 regularization to prevent overfitting. Additionally, XGBoost employs parallelization, reducing training time significantly compared to standard Gradient Boosting. It is widely used in machine learning competitions due to its superior accuracy and efficiency. The key advantages include its ability to handle missing data, built-in regularization for better generalization, and fast training through parallelization. However, XGBoost requires careful hyperparameter tuning to achieve optimal performance, and the model can become overly complex, making interpretation difficult. It is also memory-intensive, especially for large datasets, and can be challenging to deploy efficiently in real-time applications.

  1. The xgboost model from the xgboost Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
    • max_depth = maximum depth of the tree made to vary between 3 and 6
    • gamma = minimum loss reduction required to make a further split in a tree made to vary between 0.10 and 0.20
    • n_estimators = number of boosting stages to perform made to vary between 50 and 100
  3. A special hyperparameter (scale_pos_weight = 1.7) was fixed to address the minimal 1.7:1 class imbalance observed between the B and M diagnosis categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.10
    • max_depth = 6
    • gamma 0.20
    • n_estimators = 100
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 1.0000
    • Precision = 1.0000
    • Recall = 1.0000
    • F1 Score = 1.0000
    • AUROC = 1.0000
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9439
    • Precision = 0.9722
    • Recall = 0.8750
    • F1 Score = 0.9210
    • AUROC = 0.9300
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [158]:
##################################
# Defining the missing value imputation, scaling and PCA preprocessing parameters
##################################
scaling_pca_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),           
    ('pca', PCA(n_components=10, random_state=987654321))  
])
In [159]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_xgb_pipeline = Pipeline([
    ('scaling_pca_preprocessor', scaling_pca_preprocessor),
    ('boosted_xgb_model', XGBClassifier(scale_pos_weight=1.7, 
                                        random_state=987654321,
                                        subsample=0.7,
                                        colsample_bytree=0.7,
                                        eval_metric='logloss'))
])
In [160]:
##################################
# Defining hyperparameter grid
##################################
boosted_xgb_hyperparameter_grid = {
    'boosted_xgb_model__learning_rate': [0.01, 0.10],
    'boosted_xgb_model__max_depth': [3, 6], 
    'boosted_xgb_model__gamma': [0.1, 0.2],
    'boosted_xgb_model__n_estimators': [50, 100]
}
In [161]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [162]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_xgb_grid_search = GridSearchCV(
    estimator=boosted_xgb_pipeline,
    param_grid=boosted_xgb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [163]:
##################################
# Encoding the response variables
# for model training and validation
##################################
y_train_encoded = y_train.map({'B': 0, 'M': 1})
y_validation_encoded = y_validation.map({'B': 0, 'M': 1})
In [164]:
##################################
# Fitting GridSearchCV
##################################
boosted_xgb_grid_search.fit(X_train, y_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[164]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('scaling_pca_preprocessor',
                                        Pipeline(steps=[('imputer',
                                                         SimpleImputer(strategy='median')),
                                                        ('scaler',
                                                         StandardScaler()),
                                                        ('pca',
                                                         PCA(n_components=10,
                                                             random_state=987654321))])),
                                       ('boosted_xgb_model',
                                        XGBClassifier(base_score=None,
                                                      booster=None,
                                                      c...
                                                      missing=nan,
                                                      monotone_constraints=None,
                                                      multi_strategy=None,
                                                      n_estimators=None,
                                                      n_jobs=None,
                                                      num_parallel_tree=None,
                                                      random_state=987654321, ...))]),
             n_jobs=-1,
             param_grid={'boosted_xgb_model__gamma': [0.1, 0.2],
                         'boosted_xgb_model__learning_rate': [0.01, 0.1],
                         'boosted_xgb_model__max_depth': [3, 6],
                         'boosted_xgb_model__n_estimators': [50, 100]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step...54321, ...))])
param_grid  {'boosted_xgb_model__gamma': [0.1, 0.2], 'boosted_xgb_model__learning_rate': [0.01, 0.1], 'boosted_xgb_model__max_depth': [3, 6], 'boosted_xgb_model__n_estimators': [50, 100]}
scoring  'f1'
n_jobs  -1
refit  True
cv  RepeatedStrat...ate=987654321)
verbose  1
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
Parameters
steps  [('imputer', ...), ('scaler', ...), ...]
transform_input  None
memory  None
verbose  False
Parameters
missing_values  nan
strategy  'median'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
Parameters
n_components  10
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  987654321
Parameters
objective  'binary:logistic'
base_score  None
booster  None
callbacks  None
colsample_bylevel  None
colsample_bynode  None
colsample_bytree  0.7
device  None
early_stopping_rounds  None
enable_categorical  False
eval_metric  'logloss'
feature_types  None
gamma  0.2
grow_policy  None
importance_type  None
interaction_constraints  None
learning_rate  0.1
max_bin  None
max_cat_threshold  None
max_cat_to_onehot  None
max_delta_step  None
max_depth  6
max_leaves  None
min_child_weight  None
missing  nan
monotone_constraints  None
multi_strategy  None
n_estimators  100
n_jobs  None
num_parallel_tree  None
random_state  987654321
reg_alpha  None
reg_lambda  None
sampling_method  None
scale_pos_weight  1.7
subsample  0.7
tree_method  None
validate_parameters  None
verbosity  None
In [165]:
##################################
# Identifying the best model
##################################
boosted_xgb_optimal = boosted_xgb_grid_search.best_estimator_
In [166]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_xgb_optimal_f1_cv = boosted_xgb_grid_search.best_score_
boosted_xgb_optimal_f1_train = f1_score(y_train_encoded, boosted_xgb_optimal.predict(X_train))
boosted_xgb_optimal_f1_validation = f1_score(y_validation_encoded, boosted_xgb_optimal.predict(X_validation))
In [167]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - XGBoost: ')
print(f"Best XGBoost Hyperparameters: {boosted_xgb_grid_search.best_params_}")
Best Boosted Model - XGBoost: 
Best XGBoost Hyperparameters: {'boosted_xgb_model__gamma': 0.2, 'boosted_xgb_model__learning_rate': 0.1, 'boosted_xgb_model__max_depth': 6, 'boosted_xgb_model__n_estimators': 100}
In [168]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_xgb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_xgb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_train_encoded, boosted_xgb_optimal.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9461
F1 Score on Training Data: 1.0000

Classification Report on Train Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       1.00      1.00      1.00       119

    accuracy                           1.00       319
   macro avg       1.00      1.00      1.00       319
weighted avg       1.00      1.00      1.00       319

In [169]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_train_encoded, boosted_xgb_optimal.predict(X_train))
cm_normalized = confusion_matrix(y_train_encoded, boosted_xgb_optimal.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [170]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_xgb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation_encoded, boosted_xgb_optimal.predict(X_validation)))
F1 Score on Validation Data: 0.9211

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.93      0.99      0.96        67
           1       0.97      0.88      0.92        40

    accuracy                           0.94       107
   macro avg       0.95      0.93      0.94       107
weighted avg       0.95      0.94      0.94       107

In [171]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation_encoded, boosted_xgb_optimal.predict(X_validation))
cm_normalized = confusion_matrix(y_validation_encoded, boosted_xgb_optimal.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal XGBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [172]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_xgb_optimal_train = model_performance_evaluation(y_train_encoded, boosted_xgb_optimal.predict(X_train))
boosted_xgb_optimal_train['model'] = ['boosted_xgb_optimal'] * 5
boosted_xgb_optimal_train['set'] = ['train'] * 5
print('Optimal XGBoost Train Performance Metrics: ')
display(boosted_xgb_optimal_train)
Optimal XGBoost Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 1.0 boosted_xgb_optimal train
1 Precision 1.0 boosted_xgb_optimal train
2 Recall 1.0 boosted_xgb_optimal train
3 F1 1.0 boosted_xgb_optimal train
4 AUROC 1.0 boosted_xgb_optimal train
In [173]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_xgb_optimal_validation = model_performance_evaluation(y_validation_encoded, boosted_xgb_optimal.predict(X_validation))
boosted_xgb_optimal_validation['model'] = ['boosted_xgb_optimal'] * 5
boosted_xgb_optimal_validation['set'] = ['validation'] * 5
print('Optimal XGBoost Validation Performance Metrics: ')
display(boosted_xgb_optimal_validation)
Optimal XGBoost Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.943925 boosted_xgb_optimal validation
1 Precision 0.972222 boosted_xgb_optimal validation
2 Recall 0.875000 boosted_xgb_optimal validation
3 F1 0.921053 boosted_xgb_optimal validation
4 AUROC 0.930037 boosted_xgb_optimal validation
In [174]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_xgb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_xgboost_optimal.pkl"))
Out[174]:
['..\\models\\boosted_model_xgboost_optimal.pkl']

1.7.5 Light GBM¶

Light GBM (Light Gradient Boosting Machine) is a variation of Gradient Boosting designed for efficiency and scalability. Unlike traditional boosting methods that grow trees level by level, LightGBM grows trees leaf-wise, choosing the most informative splits, leading to faster convergence. It also uses histogram-based binning to speed up computations. These optimizations allow LightGBM to train on large datasets efficiently while maintaining high accuracy. Its advantages include faster training speed, reduced memory usage, and strong predictive performance, particularly for large datasets with many features. However, LightGBM can overfit more easily than XGBoost if not properly tuned, and it may not perform as well on small datasets. Additionally, its handling of categorical variables requires careful preprocessing, and the leaf-wise tree growth can sometimes lead to instability if not controlled properly.

  1. The light gbm model from the lightgbm Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
    • min_child_samples = minimum number of data needed in a child 3 and 6
    • num_leaves = maximum tree leaves for base learners made to vary between 8 and 16
    • n_estimators = number of boosted trees to fit made to vary between 50 and 100
  3. A special hyperparameter (scale_pos_weight = 1.7) was fixed to address the minimal 1.7:1 class imbalance observed between the B and M diagnosis categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.10
    • min_child_samples = 6
    • num_leaves 16
    • n_estimators = 50
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 1.0000
    • Precision = 1.0000
    • Recall = 1.0000
    • F1 Score = 1.0000
    • AUROC = 1.0000
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9532
    • Precision = 0.9729
    • Recall = 0.9000
    • F1 Score = 0.9350
    • AUROC = 0.9425
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [175]:
##################################
# Defining the missing value imputation, scaling and PCA preprocessing parameters
##################################
scaling_pca_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),           
    ('pca', PCA(n_components=10, random_state=987654321))  
])
In [176]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_lgbm_pipeline = Pipeline([
    ('scaling_pca_preprocessor', scaling_pca_preprocessor),
    ('boosted_lgbm_model', LGBMClassifier(scale_pos_weight=1.7, 
                                          random_state=987654321,
                                          max_depth=-1,
                                          feature_fraction =0.7,
                                          bagging_fraction=0.7,
                                          verbose=-1))
])
In [177]:
##################################
# Defining hyperparameter grid
##################################
boosted_lgbm_hyperparameter_grid = {
    'boosted_lgbm_model__learning_rate': [0.01, 0.10],
    'boosted_lgbm_model__min_child_samples': [3, 6], 
    'boosted_lgbm_model__num_leaves': [8, 16],
    'boosted_lgbm_model__n_estimators': [50, 100] 
}
In [178]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [179]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_lgbm_grid_search = GridSearchCV(
    estimator=boosted_lgbm_pipeline,
    param_grid=boosted_lgbm_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [180]:
##################################
# Encoding the response variables
# for model training and validation
##################################
y_train_encoded = y_train.map({'B': 0, 'M': 1})
y_validation_encoded = y_validation.map({'B': 0, 'M': 1})
In [181]:
##################################
# Fitting GridSearchCV
##################################
boosted_lgbm_grid_search.fit(X_train, y_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[181]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('scaling_pca_preprocessor',
                                        Pipeline(steps=[('imputer',
                                                         SimpleImputer(strategy='median')),
                                                        ('scaler',
                                                         StandardScaler()),
                                                        ('pca',
                                                         PCA(n_components=10,
                                                             random_state=987654321))])),
                                       ('boosted_lgbm_model',
                                        LGBMClassifier(bagging_fraction=0.7,
                                                       feature_fraction=0.7,
                                                       random_state=987654321,
                                                       scale_pos_weight=1.7,
                                                       verbose=-1))]),
             n_jobs=-1,
             param_grid={'boosted_lgbm_model__learning_rate': [0.01, 0.1],
                         'boosted_lgbm_model__min_child_samples': [3, 6],
                         'boosted_lgbm_model__n_estimators': [50, 100],
                         'boosted_lgbm_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step...verbose=-1))])
param_grid  {'boosted_lgbm_model__learning_rate': [0.01, 0.1], 'boosted_lgbm_model__min_child_samples': [3, 6], 'boosted_lgbm_model__n_estimators': [50, 100], 'boosted_lgbm_model__num_leaves': [8, 16]}
scoring  'f1'
n_jobs  -1
refit  True
cv  RepeatedStrat...ate=987654321)
verbose  1
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
Parameters
steps  [('imputer', ...), ('scaler', ...), ...]
transform_input  None
memory  None
verbose  False
Parameters
missing_values  nan
strategy  'median'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
Parameters
n_components  10
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  987654321
Parameters
boosting_type  'gbdt'
num_leaves  16
max_depth  -1
learning_rate  0.1
n_estimators  50
subsample_for_bin  200000
objective  None
class_weight  None
min_split_gain  0.0
min_child_weight  0.001
min_child_samples  6
subsample  1.0
subsample_freq  0
colsample_bytree  1.0
reg_alpha  0.0
reg_lambda  0.0
random_state  987654321
n_jobs  None
importance_type  'split'
scale_pos_weight  1.7
feature_fraction  0.7
bagging_fraction  0.7
verbose  -1
In [182]:
##################################
# Identifying the best model
##################################
boosted_lgbm_optimal = boosted_lgbm_grid_search.best_estimator_
In [183]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_lgbm_optimal_f1_cv = boosted_lgbm_grid_search.best_score_
boosted_lgbm_optimal_f1_train = f1_score(y_train_encoded, boosted_lgbm_optimal.predict(X_train))
boosted_lgbm_optimal_f1_validation = f1_score(y_validation_encoded, boosted_lgbm_optimal.predict(X_validation))
In [184]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - Light GBM: ')
print(f"Best Light GBM Hyperparameters: {boosted_lgbm_grid_search.best_params_}")
Best Boosted Model - Light GBM: 
Best Light GBM Hyperparameters: {'boosted_lgbm_model__learning_rate': 0.1, 'boosted_lgbm_model__min_child_samples': 6, 'boosted_lgbm_model__n_estimators': 50, 'boosted_lgbm_model__num_leaves': 16}
In [185]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_lgbm_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_lgbm_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_train_encoded, boosted_lgbm_optimal.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9385
F1 Score on Training Data: 1.0000

Classification Report on Train Data:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       200
           1       1.00      1.00      1.00       119

    accuracy                           1.00       319
   macro avg       1.00      1.00      1.00       319
weighted avg       1.00      1.00      1.00       319

In [186]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_train_encoded, boosted_lgbm_optimal.predict(X_train))
cm_normalized = confusion_matrix(y_train_encoded, boosted_lgbm_optimal.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [187]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_lgbm_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation_encoded, boosted_lgbm_optimal.predict(X_validation)))
F1 Score on Validation Data: 0.9351

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.94      0.99      0.96        67
           1       0.97      0.90      0.94        40

    accuracy                           0.95       107
   macro avg       0.96      0.94      0.95       107
weighted avg       0.95      0.95      0.95       107

In [188]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation_encoded, boosted_lgbm_optimal.predict(X_validation))
cm_normalized = confusion_matrix(y_validation_encoded, boosted_lgbm_optimal.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal Light GBM Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [189]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_lgbm_optimal_train = model_performance_evaluation(y_train_encoded, boosted_lgbm_optimal.predict(X_train))
boosted_lgbm_optimal_train['model'] = ['boosted_lgbm_optimal'] * 5
boosted_lgbm_optimal_train['set'] = ['train'] * 5
print('Optimal Light GBM Train Performance Metrics: ')
display(boosted_lgbm_optimal_train)
Optimal Light GBM Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 1.0 boosted_lgbm_optimal train
1 Precision 1.0 boosted_lgbm_optimal train
2 Recall 1.0 boosted_lgbm_optimal train
3 F1 1.0 boosted_lgbm_optimal train
4 AUROC 1.0 boosted_lgbm_optimal train
In [190]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_lgbm_optimal_validation = model_performance_evaluation(y_validation_encoded, boosted_lgbm_optimal.predict(X_validation))
boosted_lgbm_optimal_validation['model'] = ['boosted_lgbm_optimal'] * 5
boosted_lgbm_optimal_validation['set'] = ['validation'] * 5
print('Optimal Light GBM Validation Performance Metrics: ')
display(boosted_lgbm_optimal_validation)
Optimal Light GBM Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.953271 boosted_lgbm_optimal validation
1 Precision 0.972973 boosted_lgbm_optimal validation
2 Recall 0.900000 boosted_lgbm_optimal validation
3 F1 0.935065 boosted_lgbm_optimal validation
4 AUROC 0.942537 boosted_lgbm_optimal validation
In [191]:
################################## 
joblib.dump(boosted_lgbm_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_light_gbm_optimal.pkl"))
Out[191]:
['..\\models\\boosted_model_light_gbm_optimal.pkl']

1.7.6 CatBoost¶

CatBoost (Categorical Boosting) is a boosting algorithm optimized for categorical data. Unlike other gradient boosting methods that require categorical variables to be manually encoded, CatBoost handles them natively, reducing preprocessing effort and improving performance. It builds decision trees iteratively, like other boosting methods, but uses ordered boosting to prevent target leakage and enhance generalization. The main advantages of CatBoost are its ability to handle categorical data without extensive preprocessing, high accuracy with minimal tuning, and robustness against overfitting due to built-in regularization. Additionally, it is relatively fast and memory-efficient. However, CatBoost can still be slower than LightGBM on very large datasets, and while it requires less tuning, improper parameter selection can lead to suboptimal performance. Its internal mechanics, such as ordered boosting, make interpretation more complex compared to simpler models.

  1. The catboost model from the catboost Python library API was implemented.
  2. The model contains 4 hyperparameters for tuning:
    • learning_rate = step size at which weights are updated during training made to vary between 0.01 and 0.10
    • max_depth = maximum depth of each decision tree in the boosting process made to vary between 3 and 6
    • num_leaves = maximum tree leaves for base learners made to vary between 8 and 16
    • iterations = number of boosted trees to fit made to vary between 50 and 100
  3. A special hyperparameter (scale_pos_weight = 1.7) was fixed to address the minimal 1.7:1 class imbalance observed between the B and M diagnosis categories.
  4. Hyperparameter tuning was conducted using the 5-cycle 5-fold cross-validation method with optimal model performance using the F1 score determined for:
    • learning_rate = 0.1
    • min_child_samples = 6
    • num_leaves = 8
    • n_estimators = 100
  5. The apparent model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9968
    • Precision = 0.9916
    • Recall = 1.0000
    • F1 Score = 0.9958
    • AUROC = 0.9975
  6. The independent validation model performance of the optimal model is summarized as follows:
    • Accuracy = 0.9626
    • Precision = 0.9736
    • Recall = 0.9250
    • F1 Score = 0.9487
    • AUROC = 0.9550
  7. Sufficiently comparable apparent and independent validation model performance observed that might be indicative of the absence of excessive model overfitting.
In [192]:
##################################
# Defining the missing value imputation, scaling and PCA preprocessing parameters
##################################
scaling_pca_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),           
    ('pca', PCA(n_components=10, random_state=987654321))  
])
In [193]:
##################################
# Defining the preprocessing and modeling pipeline parameters
##################################
boosted_cb_pipeline = Pipeline([
    ('scaling_pca_preprocessor', scaling_pca_preprocessor),
    ('boosted_cb_model', CatBoostClassifier(scale_pos_weight=2.0, 
                                            random_state=987654321,
                                            subsample =0.7,
                                            colsample_bylevel=0.7,
                                            grow_policy='Lossguide',
                                            verbose=0,
                                            allow_writing_files=False))
])
In [194]:
##################################
# Defining hyperparameter grid
##################################
boosted_cb_hyperparameter_grid = {
    'boosted_cb_model__learning_rate': [0.01, 0.10],
    'boosted_cb_model__max_depth': [3, 6], 
    'boosted_cb_model__num_leaves': [8, 16],
    'boosted_cb_model__iterations': [50, 100]
}
In [195]:
##################################
# Defining the cross-validation strategy (5-cycle 5-fold CV)
##################################
cv_strategy = RepeatedStratifiedKFold(n_splits=5, 
                                      n_repeats=5, 
                                      random_state=987654321)
In [196]:
##################################
# Performing Grid Search with cross-validation
##################################
boosted_cb_grid_search = GridSearchCV(
    estimator=boosted_cb_pipeline,
    param_grid=boosted_cb_hyperparameter_grid,
    scoring='f1',
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1
)
In [197]:
##################################
# Encoding the response variables
# for model training and validation
##################################
y_train_encoded = y_train.map({'B': 0, 'M': 1})
y_validation_encoded = y_validation.map({'B': 0, 'M': 1})
In [198]:
##################################
# Fitting GridSearchCV
##################################
boosted_cb_grid_search.fit(X_train, y_train_encoded)
Fitting 25 folds for each of 16 candidates, totalling 400 fits
Out[198]:
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=987654321),
             estimator=Pipeline(steps=[('scaling_pca_preprocessor',
                                        Pipeline(steps=[('imputer',
                                                         SimpleImputer(strategy='median')),
                                                        ('scaler',
                                                         StandardScaler()),
                                                        ('pca',
                                                         PCA(n_components=10,
                                                             random_state=987654321))])),
                                       ('boosted_cb_model',
                                        <catboost.core.CatBoostClassifier object at 0x00000296BF6D7800>)]),
             n_jobs=-1,
             param_grid={'boosted_cb_model__iterations': [50, 100],
                         'boosted_cb_model__learning_rate': [0.01, 0.1],
                         'boosted_cb_model__max_depth': [3, 6],
                         'boosted_cb_model__num_leaves': [8, 16]},
             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step...96BF6D7800>)])
param_grid  {'boosted_cb_model__iterations': [50, 100], 'boosted_cb_model__learning_rate': [0.01, 0.1], 'boosted_cb_model__max_depth': [3, 6], 'boosted_cb_model__num_leaves': [8, 16]}
scoring  'f1'
n_jobs  -1
refit  True
cv  RepeatedStrat...ate=987654321)
verbose  1
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
Parameters
steps  [('imputer', ...), ('scaler', ...), ...]
transform_input  None
memory  None
verbose  False
Parameters
missing_values  nan
strategy  'median'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
Parameters
n_components  10
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  987654321
<catboost.core.CatBoostClassifier object at 0x00000296BFD1B7D0>
In [199]:
##################################
# Identifying the best model
##################################
boosted_cb_optimal = boosted_cb_grid_search.best_estimator_
In [200]:
##################################
# Evaluating the F1 scores
# on the training, cross-validation, and validation data
##################################
boosted_cb_optimal_f1_cv = boosted_cb_grid_search.best_score_
boosted_cb_optimal_f1_train = f1_score(y_train_encoded, boosted_cb_optimal.predict(X_train))
boosted_cb_optimal_f1_validation = f1_score(y_validation_encoded, boosted_cb_optimal.predict(X_validation))
In [201]:
##################################
# Identifying the optimal model
##################################
print('Best Boosted Model - CatBoost: ')
print(f"Best CatBoost Hyperparameters: {boosted_cb_grid_search.best_params_}")
Best Boosted Model - CatBoost: 
Best CatBoost Hyperparameters: {'boosted_cb_model__iterations': 100, 'boosted_cb_model__learning_rate': 0.1, 'boosted_cb_model__max_depth': 6, 'boosted_cb_model__num_leaves': 8}
In [202]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the training and cross-validated data
# to assess overfitting optimism
##################################
print(f"F1 Score on Cross-Validated Data: {boosted_cb_optimal_f1_cv:.4f}")
print(f"F1 Score on Training Data: {boosted_cb_optimal_f1_train:.4f}")
print("\nClassification Report on Train Data:\n", classification_report(y_train_encoded, boosted_cb_optimal.predict(X_train)))
F1 Score on Cross-Validated Data: 0.9295
F1 Score on Training Data: 0.9958

Classification Report on Train Data:
               precision    recall  f1-score   support

           0       1.00      0.99      1.00       200
           1       0.99      1.00      1.00       119

    accuracy                           1.00       319
   macro avg       1.00      1.00      1.00       319
weighted avg       1.00      1.00      1.00       319

In [203]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the train data
##################################
cm_raw = confusion_matrix(y_train_encoded, boosted_cb_optimal.predict(X_train))
cm_normalized = confusion_matrix(y_train_encoded, boosted_cb_optimal.predict(X_train), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Train Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [204]:
##################################
# Summarizing the F1 score results
# and classification metrics
# on the validation data
# to assess overfitting optimism
##################################
print(f"F1 Score on Validation Data: {boosted_cb_optimal_f1_validation:.4f}")
print("\nClassification Report on Validation Data:\n", classification_report(y_validation_encoded, boosted_cb_optimal.predict(X_validation)))
F1 Score on Validation Data: 0.9487

Classification Report on Validation Data:
               precision    recall  f1-score   support

           0       0.96      0.99      0.97        67
           1       0.97      0.93      0.95        40

    accuracy                           0.96       107
   macro avg       0.97      0.96      0.96       107
weighted avg       0.96      0.96      0.96       107

In [205]:
##################################
# Formulating the raw and normalized
# confusion matrices
# from the validation data
##################################
cm_raw = confusion_matrix(y_validation_encoded, boosted_cb_optimal.predict(X_validation))
cm_normalized = confusion_matrix(y_validation_encoded, boosted_cb_optimal.predict(X_validation), normalize='true')
fig, ax = plt.subplots(1, 2, figsize=(17, 8))
sns.heatmap(cm_raw, annot=True, fmt='d', cmap='Blues', ax=ax[0])
ax[0].set_title('Raw Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[0].set_xlabel('Predicted')
ax[0].set_ylabel('Actual')
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues', ax=ax[1])
ax[1].set_title('Normalized Confusion Matrix: Optimal CatBoost Validation Performance', fontsize=11)
ax[1].set_xlabel('Predicted')
ax[1].set_ylabel('Actual')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [206]:
##################################
# Gathering the model evaluation metrics
# for the train data
##################################
boosted_cb_optimal_train = model_performance_evaluation(y_train_encoded, boosted_cb_optimal.predict(X_train))
boosted_cb_optimal_train['model'] = ['boosted_cb_optimal'] * 5
boosted_cb_optimal_train['set'] = ['train'] * 5
print('Optimal CatBoost Train Performance Metrics: ')
display(boosted_cb_optimal_train)
Optimal CatBoost Train Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.996865 boosted_cb_optimal train
1 Precision 0.991667 boosted_cb_optimal train
2 Recall 1.000000 boosted_cb_optimal train
3 F1 0.995816 boosted_cb_optimal train
4 AUROC 0.997500 boosted_cb_optimal train
In [207]:
##################################
# Gathering the model evaluation metrics
# for the validation data
##################################
boosted_cb_optimal_validation = model_performance_evaluation(y_validation_encoded, boosted_cb_optimal.predict(X_validation))
boosted_cb_optimal_validation['model'] = ['boosted_cb_optimal'] * 5
boosted_cb_optimal_validation['set'] = ['validation'] * 5
print('Optimal CatBoost Validation Performance Metrics: ')
display(boosted_cb_optimal_validation)
Optimal CatBoost Validation Performance Metrics: 
metric_name metric_value model set
0 Accuracy 0.962617 boosted_cb_optimal validation
1 Precision 0.973684 boosted_cb_optimal validation
2 Recall 0.925000 boosted_cb_optimal validation
3 F1 0.948718 boosted_cb_optimal validation
4 AUROC 0.955037 boosted_cb_optimal validation
In [208]:
##################################
# Saving the best individual model
# developed from the train data
################################## 
joblib.dump(boosted_cb_optimal, 
            os.path.join("..", MODELS_PATH, "boosted_model_catboost_optimal.pkl"))
Out[208]:
['..\\models\\boosted_model_catboost_optimal.pkl']

1.8. Model Selection ¶

  1. Among 6 candidate models, the Categorical Boosting Model was selected as the final model by demonstrating the best F1 Score for the independent validation data with minimal overfitting :
    • Apparent F1 Score Performance = 0.9958
    • Independent Validation F1 Score Performance = 0.9487
  2. The final model similarly demonstrated consistently high F1 Score for the test data :
    • Independent Test F1 Score Performance = 0.9549
  3. The final model configuration is described as follows:
    • catboost with optimal hyperparameters:
      • learning_rate = 0.1
      • min_child_samples = 6
      • num_leaves = 8
      • n_estimators = 100
In [209]:
##################################
# Consolidating all the
# bagged, boosted, stacked and blended
# model performance measures
# for the train and validation data
##################################
ensemble_train_validation_all_performance = pd.concat([bagged_rf_optimal_train,
                                             bagged_rf_optimal_validation,                                            
                                             boosted_ab_optimal_train,
                                             boosted_ab_optimal_validation,
                                             boosted_gb_optimal_train,
                                             boosted_gb_optimal_validation,
                                             boosted_xgb_optimal_train,
                                             boosted_xgb_optimal_validation,
                                             boosted_lgbm_optimal_train,
                                             boosted_lgbm_optimal_validation,
                                             boosted_cb_optimal_train,
                                             boosted_cb_optimal_validation], 
                                            ignore_index=True)
print('Consolidated Ensemble Model Performance on Train and Validation Data: ')
display(ensemble_train_validation_all_performance)
Consolidated Ensemble Model Performance on Train and Validation Data: 
metric_name metric_value model set
0 Accuracy 0.974922 bagged_rf_optimal train
1 Precision 0.974359 bagged_rf_optimal train
2 Recall 0.957983 bagged_rf_optimal train
3 F1 0.966102 bagged_rf_optimal train
4 AUROC 0.971492 bagged_rf_optimal train
5 Accuracy 0.934579 bagged_rf_optimal validation
6 Precision 0.971429 bagged_rf_optimal validation
7 Recall 0.850000 bagged_rf_optimal validation
8 F1 0.906667 bagged_rf_optimal validation
9 AUROC 0.917537 bagged_rf_optimal validation
10 Accuracy 0.993730 boosted_ab_optimal train
11 Precision 1.000000 boosted_ab_optimal train
12 Recall 0.983193 boosted_ab_optimal train
13 F1 0.991525 boosted_ab_optimal train
14 AUROC 0.991597 boosted_ab_optimal train
15 Accuracy 0.943925 boosted_ab_optimal validation
16 Precision 0.972222 boosted_ab_optimal validation
17 Recall 0.875000 boosted_ab_optimal validation
18 F1 0.921053 boosted_ab_optimal validation
19 AUROC 0.930037 boosted_ab_optimal validation
20 Accuracy 1.000000 boosted_gb_optimal train
21 Precision 1.000000 boosted_gb_optimal train
22 Recall 1.000000 boosted_gb_optimal train
23 F1 1.000000 boosted_gb_optimal train
24 AUROC 1.000000 boosted_gb_optimal train
25 Accuracy 0.934579 boosted_gb_optimal validation
26 Precision 0.971429 boosted_gb_optimal validation
27 Recall 0.850000 boosted_gb_optimal validation
28 F1 0.906667 boosted_gb_optimal validation
29 AUROC 0.917537 boosted_gb_optimal validation
30 Accuracy 1.000000 boosted_xgb_optimal train
31 Precision 1.000000 boosted_xgb_optimal train
32 Recall 1.000000 boosted_xgb_optimal train
33 F1 1.000000 boosted_xgb_optimal train
34 AUROC 1.000000 boosted_xgb_optimal train
35 Accuracy 0.943925 boosted_xgb_optimal validation
36 Precision 0.972222 boosted_xgb_optimal validation
37 Recall 0.875000 boosted_xgb_optimal validation
38 F1 0.921053 boosted_xgb_optimal validation
39 AUROC 0.930037 boosted_xgb_optimal validation
40 Accuracy 1.000000 boosted_lgbm_optimal train
41 Precision 1.000000 boosted_lgbm_optimal train
42 Recall 1.000000 boosted_lgbm_optimal train
43 F1 1.000000 boosted_lgbm_optimal train
44 AUROC 1.000000 boosted_lgbm_optimal train
45 Accuracy 0.953271 boosted_lgbm_optimal validation
46 Precision 0.972973 boosted_lgbm_optimal validation
47 Recall 0.900000 boosted_lgbm_optimal validation
48 F1 0.935065 boosted_lgbm_optimal validation
49 AUROC 0.942537 boosted_lgbm_optimal validation
50 Accuracy 0.996865 boosted_cb_optimal train
51 Precision 0.991667 boosted_cb_optimal train
52 Recall 1.000000 boosted_cb_optimal train
53 F1 0.995816 boosted_cb_optimal train
54 AUROC 0.997500 boosted_cb_optimal train
55 Accuracy 0.962617 boosted_cb_optimal validation
56 Precision 0.973684 boosted_cb_optimal validation
57 Recall 0.925000 boosted_cb_optimal validation
58 F1 0.948718 boosted_cb_optimal validation
59 AUROC 0.955037 boosted_cb_optimal validation
In [210]:
##################################
# Consolidating all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_all_performance_F1 = ensemble_train_validation_all_performance[ensemble_train_validation_all_performance['metric_name']=='F1']
ensemble_train_validation_all_performance_F1_train = ensemble_train_validation_all_performance_F1[ensemble_train_validation_all_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_train_validation_all_performance_F1_validation = ensemble_train_validation_all_performance_F1[ensemble_train_validation_all_performance_F1['set']=='validation'].loc[:,"metric_value"]
In [211]:
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_train_validation_all_performance_F1_plot = pd.DataFrame({'train': ensemble_train_validation_all_performance_F1_train.values,
                                                              'validation': ensemble_train_validation_all_performance_F1_validation.values},
                                                             index=ensemble_train_validation_all_performance_F1['model'].unique())
ensemble_train_validation_all_performance_F1_plot
Out[211]:
train validation
bagged_rf_optimal 0.966102 0.906667
boosted_ab_optimal 0.991525 0.921053
boosted_gb_optimal 1.000000 0.906667
boosted_xgb_optimal 1.000000 0.921053
boosted_lgbm_optimal 1.000000 0.935065
boosted_cb_optimal 0.995816 0.948718
In [212]:
##################################
# Plotting all the F1 score
# model performance measures
# between the train and validation sets
##################################
ensemble_train_validation_all_performance_F1_plot = ensemble_train_validation_all_performance_F1_plot.plot.barh(figsize=(10, 7), width=0.9)
ensemble_train_validation_all_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_train_validation_all_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train and Validation Data")
ensemble_train_validation_all_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_train_validation_all_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_train_validation_all_performance_F1_plot.grid(False)
ensemble_train_validation_all_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_train_validation_all_performance_F1_plot.containers:
    ensemble_train_validation_all_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
No description has been provided for this image
In [213]:
##################################
# Gathering all model performance measures
# for the validation data
##################################
ensemble_train_validation_all_performance_Accuracy_validation = ensemble_train_validation_all_performance[(ensemble_train_validation_all_performance['set']=='validation') & (ensemble_train_validation_all_performance['metric_name']=='Accuracy')].loc[:,"metric_value"]
ensemble_train_validation_all_performance_Precision_validation = ensemble_train_validation_all_performance[(ensemble_train_validation_all_performance['set']=='validation') & (ensemble_train_validation_all_performance['metric_name']=='Precision')].loc[:,"metric_value"]
ensemble_train_validation_all_performance_Recall_validation = ensemble_train_validation_all_performance[(ensemble_train_validation_all_performance['set']=='validation') & (ensemble_train_validation_all_performance['metric_name']=='Recall')].loc[:,"metric_value"]
ensemble_train_validation_all_performance_F1_validation = ensemble_train_validation_all_performance[(ensemble_train_validation_all_performance['set']=='validation') & (ensemble_train_validation_all_performance['metric_name']=='F1')].loc[:,"metric_value"]
ensemble_train_validation_all_performance_AUROC_validation = ensemble_train_validation_all_performance[(ensemble_train_validation_all_performance['set']=='validation') & (ensemble_train_validation_all_performance['metric_name']=='AUROC')].loc[:,"metric_value"]
In [214]:
##################################
# Combining all the model performance measures
# for the validation data
##################################
ensemble_train_validation_all_performance_all_plot_validation = pd.DataFrame({'accuracy': ensemble_train_validation_all_performance_Accuracy_validation.values,
                                                                    'precision': ensemble_train_validation_all_performance_Precision_validation.values,
                                                                    'recall': ensemble_train_validation_all_performance_Recall_validation.values,
                                                                    'f1': ensemble_train_validation_all_performance_F1_validation.values,
                                                                    'auroc': ensemble_train_validation_all_performance_AUROC_validation.values},
                                                                   index=ensemble_train_validation_all_performance['model'].unique())
ensemble_train_validation_all_performance_all_plot_validation
Out[214]:
accuracy precision recall f1 auroc
bagged_rf_optimal 0.934579 0.971429 0.850 0.906667 0.917537
boosted_ab_optimal 0.943925 0.972222 0.875 0.921053 0.930037
boosted_gb_optimal 0.934579 0.971429 0.850 0.906667 0.917537
boosted_xgb_optimal 0.943925 0.972222 0.875 0.921053 0.930037
boosted_lgbm_optimal 0.953271 0.972973 0.900 0.935065 0.942537
boosted_cb_optimal 0.962617 0.973684 0.925 0.948718 0.955037
In [215]:
##################################
# Gathering the model evaluation metrics
# for the test data
##################################
##################################
# Defining a dictionary of models and 
# their corresponding optimal model functions
##################################
models = {
    'bagged_rf_optimal': bagged_rf_optimal,
    'boosted_ab_optimal': boosted_ab_optimal,
    'boosted_gb_optimal': boosted_gb_optimal,
    'boosted_xgb_optimal': boosted_xgb_optimal,
    'boosted_lgbm_optimal': boosted_lgbm_optimal,
    'boosted_cb_optimal': boosted_cb_optimal
}
In [216]:
##################################
# Encoding the response variables
# for model testing
##################################
y_test_encoded = y_test.map({'B': 0, 'M': 1})
In [217]:
##################################
# Storing the model evaluation metrics
# for the test data
##################################
ensemble_test_all_performance = []

##################################
# Looping through each model 
# and evaluate performance on test data
##################################
for model_name, model in models.items():
   
    # Evaluating performance
    ensemble_test_all_performance_results = model_performance_evaluation(y_test_encoded, model.predict(X_test))
    
    # Adding metadata columns
    ensemble_test_all_performance_results['model'] = model_name
    ensemble_test_all_performance_results['set'] = 'test'
    
    # Storing result
    ensemble_test_all_performance.append(ensemble_test_all_performance_results)
    
In [218]:
##################################
# Consolidating all model performance measures
# for the test data
##################################
ensemble_test_all_performance = pd.concat(ensemble_test_all_performance, ignore_index=True)
print('Consolidated Ensemble Model Performance on Test Data: ')
display(ensemble_test_all_performance)
Consolidated Ensemble Model Performance on Test Data: 
metric_name metric_value model set
0 Accuracy 0.944056 bagged_rf_optimal test
1 Precision 0.941176 bagged_rf_optimal test
2 Recall 0.905660 bagged_rf_optimal test
3 F1 0.923077 bagged_rf_optimal test
4 AUROC 0.936164 bagged_rf_optimal test
5 Accuracy 0.979021 boosted_ab_optimal test
6 Precision 0.980769 boosted_ab_optimal test
7 Recall 0.962264 boosted_ab_optimal test
8 F1 0.971429 boosted_ab_optimal test
9 AUROC 0.975577 boosted_ab_optimal test
10 Accuracy 0.965035 boosted_gb_optimal test
11 Precision 0.944444 boosted_gb_optimal test
12 Recall 0.962264 boosted_gb_optimal test
13 F1 0.953271 boosted_gb_optimal test
14 AUROC 0.964465 boosted_gb_optimal test
15 Accuracy 0.965035 boosted_xgb_optimal test
16 Precision 0.944444 boosted_xgb_optimal test
17 Recall 0.962264 boosted_xgb_optimal test
18 F1 0.953271 boosted_xgb_optimal test
19 AUROC 0.964465 boosted_xgb_optimal test
20 Accuracy 0.979021 boosted_lgbm_optimal test
21 Precision 0.962963 boosted_lgbm_optimal test
22 Recall 0.981132 boosted_lgbm_optimal test
23 F1 0.971963 boosted_lgbm_optimal test
24 AUROC 0.979455 boosted_lgbm_optimal test
25 Accuracy 0.965035 boosted_cb_optimal test
26 Precision 0.913793 boosted_cb_optimal test
27 Recall 1.000000 boosted_cb_optimal test
28 F1 0.954955 boosted_cb_optimal test
29 AUROC 0.972222 boosted_cb_optimal test
In [219]:
##################################
# Gathering all model performance measures
# for the test data
##################################
ensemble_test_all_performance_Accuracy_test = ensemble_test_all_performance[(ensemble_test_all_performance['set']=='test') & (ensemble_test_all_performance['metric_name']=='Accuracy')].loc[:,"metric_value"]
ensemble_test_all_performance_Precision_test = ensemble_test_all_performance[(ensemble_test_all_performance['set']=='test') & (ensemble_test_all_performance['metric_name']=='Precision')].loc[:,"metric_value"]
ensemble_test_all_performance_Recall_test = ensemble_test_all_performance[(ensemble_test_all_performance['set']=='test') & (ensemble_test_all_performance['metric_name']=='Recall')].loc[:,"metric_value"]
ensemble_test_all_performance_F1_test = ensemble_test_all_performance[(ensemble_test_all_performance['set']=='test') & (ensemble_test_all_performance['metric_name']=='F1')].loc[:,"metric_value"]
ensemble_test_all_performance_AUROC_test = ensemble_test_all_performance[(ensemble_test_all_performance['set']=='test') & (ensemble_test_all_performance['metric_name']=='AUROC')].loc[:,"metric_value"]
In [220]:
##################################
# Combining all the model performance measures
# for the test data
##################################
ensemble_test_all_performance_all_plot_test = pd.DataFrame({'accuracy': ensemble_test_all_performance_Accuracy_test.values,
                                                            'precision': ensemble_test_all_performance_Precision_test.values,
                                                            'recall': ensemble_test_all_performance_Recall_test.values,
                                                            'f1': ensemble_test_all_performance_F1_test.values,
                                                            'auroc': ensemble_test_all_performance_AUROC_test.values},
                                                           index=ensemble_test_all_performance['model'].unique())
ensemble_test_all_performance_all_plot_test
Out[220]:
accuracy precision recall f1 auroc
bagged_rf_optimal 0.944056 0.941176 0.905660 0.923077 0.936164
boosted_ab_optimal 0.979021 0.980769 0.962264 0.971429 0.975577
boosted_gb_optimal 0.965035 0.944444 0.962264 0.953271 0.964465
boosted_xgb_optimal 0.965035 0.944444 0.962264 0.953271 0.964465
boosted_lgbm_optimal 0.979021 0.962963 0.981132 0.971963 0.979455
boosted_cb_optimal 0.965035 0.913793 1.000000 0.954955 0.972222
In [221]:
##################################
# Consolidating all the final
# bagged, boosted, stacked and blended
# model performance measures
# for the train, validation and test data
##################################
ensemble_overall_performance = pd.concat([ensemble_train_validation_all_performance, ensemble_test_all_performance], axis=0)
In [222]:
##################################
# Consolidating all the F1 score
# model performance measures
# between the train, validation and test data
##################################
ensemble_overall_performance_F1 = ensemble_overall_performance[ensemble_overall_performance['metric_name']=='F1']
ensemble_overall_performance_F1_train = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='train'].loc[:,"metric_value"]
ensemble_overall_performance_F1_validation = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='validation'].loc[:,"metric_value"]
ensemble_overall_performance_F1_test = ensemble_overall_performance_F1[ensemble_overall_performance_F1['set']=='test'].loc[:,"metric_value"]
In [223]:
##################################
# Combining all the F1 score
# model performance measures
# between the train and validation data
##################################
ensemble_overall_performance_F1_plot = pd.DataFrame({'train': ensemble_overall_performance_F1_train.values,
                                                     'validation': ensemble_overall_performance_F1_validation.values,
                                                     'test': ensemble_overall_performance_F1_test.values},
                                                    index=ensemble_overall_performance_F1['model'].unique())
ensemble_overall_performance_F1_plot
Out[223]:
train validation test
bagged_rf_optimal 0.966102 0.906667 0.923077
boosted_ab_optimal 0.991525 0.921053 0.971429
boosted_gb_optimal 1.000000 0.906667 0.953271
boosted_xgb_optimal 1.000000 0.921053 0.953271
boosted_lgbm_optimal 1.000000 0.935065 0.971963
boosted_cb_optimal 0.995816 0.948718 0.954955
In [224]:
##################################
# Plotting all the F1 score
# model performance measures
# between train, validation and test sets
##################################
ensemble_overall_performance_F1_plot = ensemble_overall_performance_F1_plot.plot.barh(figsize=(10, 8), width=0.9)
ensemble_overall_performance_F1_plot.set_xlim(0.00,1.00)
ensemble_overall_performance_F1_plot.set_title("Model Comparison by F1 Score Performance on Train, Validation and Test Data")
ensemble_overall_performance_F1_plot.set_xlabel("F1 Score Performance")
ensemble_overall_performance_F1_plot.set_ylabel("Ensemble Model")
ensemble_overall_performance_F1_plot.grid(False)
ensemble_overall_performance_F1_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in ensemble_overall_performance_F1_plot.containers:
    ensemble_overall_performance_F1_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
No description has been provided for this image

1.9. Model Monitoring using the NannyML Framework ¶

1.9.1 Simulated Baseline Control¶

In [225]:
##################################
# Defining the global parameters
# for the post-model deployment scenario simulation
##################################
N_CHUNKS = 10
CHUNK_SIZE = 100
RANDOM_STATE = 987654321
TARGET_COL = 'diagnosis'
LABEL_MAP = {'B': 0, 'M': 1}
FEATURE_COLUMNS = [
'radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean',
'compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean',
'radius_se','texture_se','perimeter_se','area_se','smoothness_se',
'compactness_se','concavity_se','concave points_se','symmetry_se','fractal_dimension_se',
'radius_worst','texture_worst','perimeter_worst','area_worst','smoothness_worst',
'compactness_worst','concavity_worst','concave points_worst','symmetry_worst','fractal_dimension_worst'
]
In [226]:
##################################
# Creating the monitoring baseline control
# by combining both validation and test data together
##################################
breast_cancer_monitoring_baseline = pd.concat(
    [breast_cancer_validation, breast_cancer_test], 
    axis=0,            
    ignore_index=True
)
In [227]:
##################################
# Defining a function for generating
# a post-model data stream simulation
##################################
def make_stream_from_dataframe(df, n_chunks=N_CHUNKS, chunk_size=CHUNK_SIZE, random_state=RANDOM_STATE):
    """Creates a synthetic ordered stream (chunks) including at least one instance of both 'M' and 'B' classes."""
    # Initialiing a random number generator for reproducibility
    rng = np.random.RandomState(random_state)
    # Initializing an empty list to store each generated chunk
    rows = []

    # Splitting the dataframe into the two classes
    df_M = df[df[TARGET_COL] == "M"]
    df_B = df[df[TARGET_COL] == "B"]

    # Determining roughly balanced counts per chunk
    half_size = chunk_size // 2
    
    # Iterating through the desired number of chunks (simulated time intervals)
    for chunk_idx in range(n_chunks):
        # Sampling half of the chunk from each class (with replacement)
        sample_M = df_M.sample(
            n=half_size, replace=True, random_state=rng.randint(0, 2**31 - 1)
        )
        sample_B = df_B.sample(
            n=chunk_size - half_size, replace=True, random_state=rng.randint(0, 2**31 - 1)
        )

        # Combining, shuffling, and labeling with chunk/time index
        chunk = pd.concat([sample_M, sample_B], ignore_index=True).sample(
            frac=1, random_state=rng.randint(0, 2**31 - 1)
        )
        chunk["__chunk"] = chunk_idx
        chunk["__timestamp"] = chunk_idx

        rows.append(chunk)
    # Combining all chunks into a single DataFrame that represents a continuous data stream
    return pd.concat(rows, ignore_index=True)
In [228]:
##################################
# Defining a function for 
# computing model predictions and probabilities
# using the final selected model - categorical boosting model
##################################
def compute_preds_and_proba(pipeline, X):
    """Returns predicted labels and class 1 probabilities"""
    # Generating predicted class labels (0 or 1) using the trained model pipeline
    y_pred = pipeline.predict(X)
    try:
        # Obtaining the probability of the positive class (class 1)
        y_proba = pipeline.predict_proba(X)[:, 1]
    except Exception:
        # Computing the probability approximation if predict_proba is unavailable
        y_proba = 1 / (1 + np.exp(-pipeline.decision_function(X)))
    # Returning both predicted labels and corresponding class-1 probabilities
    return y_pred, y_proba
In [229]:
##################################
# Defining a function for 
# simulating the baseline control
##################################
def simulate_P1_baseline(df):
    # Creating a time-ordered synthetic stream of data chunks
    return make_stream_from_dataframe(df)
    
In [230]:
##################################
# Defining a function for 
# plotting chunk-based boxplots for selected features
# for baseline control
##################################
sns.set(style="whitegrid", context="notebook")

def plot_baseline_feature_boxplot(df_base, features, scenario_name="Baseline"):
    """Chunk-based boxplots for selected features in baseline."""
    n_features = len(features)
    fig, axes = plt.subplots(n_features, 1, figsize=(12, 3 * n_features), sharex=True)
    if n_features == 1:
        axes = [axes]
    for ax, f in zip(axes, features):
        sns.boxplot(
            data=df_base,
            x="__chunk", y=f, ax=ax, showfliers=False, color="#4C72B0"
        )
        ax.set_title(f"Chunk-wise {f}: {scenario_name}")
        ax.set_xlabel("Chunk Index (Simulated Time)")
        ax.set_ylabel(f)
        ax.set_xticks(range(10))
    plt.tight_layout()
    plt.show()
In [231]:
##################################
# Defining a function for 
# plotting feature mean per chunk
# for baseline control
##################################
def plot_baseline_feature_mean_line(df_base, features, scenario_name="Baseline"):
    """Plots per-feature mean values over chunks (one chart per feature)."""
    mean_values = df_base.groupby('__chunk')[features].mean()
    
    n_features = len(features)
    fig, axes = plt.subplots(n_features, 1, figsize=(12, 3 * n_features), sharex=True)
    if n_features == 1:
        axes = [axes]

    for ax, f in zip(axes, features):
        sns.lineplot(x=mean_values.index, y=mean_values[f], color="#4C72B0", ax=ax)
        ax.set_title(f"Chunk-wise Mean of {f} ({scenario_name})", fontsize=11)
        ax.set_xlabel("Chunk Index")
        ax.set_ylabel("Mean Value")
        ax.grid(True, alpha=0.3)
        ax.set_xticks(range(10))
    
    plt.tight_layout()
    plt.show()
In [232]:
##################################
# Defining a function for 
# plotting class proportion ('M' vs 'B') across chunks
# for baseline control
##################################
def plot_baseline_class_proportion(df_base, scenario_name="Baseline"):
    """Class proportion ('M' vs 'B') across chunks for baseline."""
    prop = df_base.groupby('__chunk')['diagnosis'].value_counts(normalize=True).unstack().fillna(0)
    fig, ax = plt.subplots(figsize=(14, 3))
    sns.lineplot(data=prop['M'], label="Proportion of 'M'", color="#4C72B0", ax=ax)
    ax.set_title(f"Class Proportion per Chunk: {scenario_name}")
    ax.set_xlabel("Chunk Index")
    ax.set_ylabel("Proportion of 'M'")
    ax.set_ylim(-0.1, 1)
    ax.set_xticks(range(10))
    plt.show()
In [233]:
##################################
# Defining a function for 
# plotting missing fraction per chunk
# for baseline control
##################################
def plot_baseline_missingness(df_base, features, scenario_name="Baseline"):
    """Missing fraction per chunk for selected features, one plot per feature."""
    miss = df_base.groupby('__chunk')[features].apply(lambda x: x.isna().mean())
    
    n_features = len(features)
    fig, axes = plt.subplots(n_features, 1, figsize=(12, 3 * n_features), sharex=True)
    if n_features == 1:
        axes = [axes]

    for ax, f in zip(axes, features):
        sns.lineplot(x=miss.index, y=miss[f], color="#4C72B0", ax=ax)
        ax.set_title(f"Missingness over Time: {f} ({scenario_name})", fontsize=11)
        ax.set_xlabel("Chunk Index")
        ax.set_ylabel("Missing Fraction")
        ax.set_ylim(-0.1, 1)
        ax.set_xticks(range(10))
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

    
In [234]:
##################################
# Simulating post-deployment data drift scenario 1 = baseline control
##################################
p1 = simulate_P1_baseline(breast_cancer_monitoring_baseline)
In [235]:
##################################
# Exploring the simulated baseline control
##################################
display(p1)
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst __chunk __timestamp
0 M 17.420 25.56 114.50 948.0 0.10060 0.11460 0.168200 0.065970 0.1308 ... 120.40 1021.0 0.1243 0.17930 0.280300 0.10990 0.1603 0.06818 0 0
1 M 22.270 19.67 152.80 1509.0 0.13260 0.27680 0.426400 0.182300 0.2556 ... 206.80 2360.0 0.1701 0.69970 0.960800 0.29100 0.4055 0.09789 0 0
2 B 11.250 14.78 71.38 390.0 0.08306 0.04458 0.000974 0.002941 0.1773 ... 82.08 492.7 0.1166 0.09794 0.005518 0.01667 0.2815 0.07418 0 0
3 B 12.250 22.44 78.18 466.5 0.08192 0.05200 0.017140 0.012610 0.1544 ... 92.74 622.9 0.1256 0.18040 0.123000 0.06335 0.3100 0.08203 0 0
4 B 10.480 19.86 66.72 337.7 0.10700 0.05971 0.048310 0.030700 0.1737 ... 73.68 402.8 0.1515 0.10260 0.118100 0.06736 0.2883 0.07748 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 B 14.030 21.25 89.79 603.4 0.09070 0.06945 0.014620 0.018960 0.1517 ... 98.27 715.5 0.1287 0.15130 0.062310 0.07963 0.2226 0.07617 9 9
996 B 13.710 18.68 88.73 571.0 0.09916 0.10700 0.053850 0.037830 0.1714 ... 99.43 701.9 0.1425 0.25660 0.193500 0.12840 0.2849 0.09031 9 9
997 B 13.080 15.71 85.63 520.0 0.10750 0.12700 0.045680 0.031100 0.1967 ... 96.09 630.5 0.1312 0.27760 0.189000 0.07283 0.3184 0.08183 9 9
998 B 8.597 18.60 54.09 221.2 0.10740 0.05847 0.000000 0.000000 0.2163 ... 56.65 240.1 0.1347 0.07767 0.000000 0.00000 0.3142 0.08116 9 9
999 M 19.790 25.12 130.40 1192.0 0.10150 0.15890 0.254500 0.114900 0.2202 ... 148.70 1589.0 0.1275 0.38610 0.567300 0.17320 0.3305 0.08465 9 9

1000 rows × 33 columns

In [236]:
##################################
# Visualizing feature variability
# for baseline control
##################################
plot_baseline_feature_boxplot(p1, FEATURE_COLUMNS)
No description has been provided for this image
In [237]:
##################################
# Visualizing feature variability
# for baseline control
##################################
plot_baseline_feature_mean_line(p1, FEATURE_COLUMNS)
No description has been provided for this image
In [238]:
##################################
# Inspecting baseline class balance stability
# for baseline control
##################################
plot_baseline_class_proportion(p1)
No description has been provided for this image
In [239]:
##################################
# Evaluating baseline missingness
# for baseline control
##################################
plot_baseline_missingness(p1, FEATURE_COLUMNS)
No description has been provided for this image
In [240]:
##################################
# Fitting a drift calculator
# Using the simulated baseline control as the reference dataset
##################################
p1_univariate_drift_df = p1.copy()
In [241]:
##################################
# Defining a function for fitting
# a drift calculator using the simulated baseline control and
# detecting univariate drift for a given scenario
##################################
def detect_univariate_drift(baseline_df, scenario_df, feature_columns, scenario_name="Scenario"):
    """
    Fits a UnivariateDriftCalculator on baseline data and detects drift on scenario data.
    """

    # Initializing the univariate drift calculator
    univariate_drift_calculator = nml.drift.UnivariateDriftCalculator(
        column_names=feature_columns,
        treat_as_categorical=None,
        continuous_methods=["kolmogorov_smirnov"]
    )

    # Fitting the univariate drift calculator on the baseline control
    univariate_drift_calculator.fit(baseline_df)

    # Detecting univaraite drift on the scenario dataset
    results = univariate_drift_calculator.calculate(
        data=scenario_df
    )

    # Summarizing the drift detection results
    summary = results.filter(period="analysis").to_df()
    print(f"Univariate drift visualization generated for {scenario_name}")
    print(summary.head(10))

    return results
    
In [242]:
##################################
# Defining a function for visualizing
# univariate drift for a given scenario
##################################
def plot_univariate_drift_summary(drift_results, feature_columns, scenario_name="Scenario"):
    """
    Visualize KS statistics vs threshold per feature and summarize drift counts.
    """
    # Converting results to a DataFrame
    df = drift_results.to_df().copy()

    # Handling MultiIndex columns
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = ['__'.join(col).strip() if isinstance(col, tuple) else col for col in df.columns]

    # Extracting chunk_index
    chunk_col_candidates = ["chunk__chunk__chunk_index", "chunk_index"]
    for col in chunk_col_candidates:
        if col in df.columns:
            df["chunk_index"] = df[col]
            break
    else:
        if "chunk_index" in df.index.names:
            df = df.reset_index()
        if "chunk_index" not in df.columns:
            raise KeyError("Cannot find 'chunk_index' in drift_results output.")

    # Identifying the KS value, threshold, and alert columns
    value_col = [c for c in df.columns if c.endswith("__kolmogorov_smirnov__value")]
    upper_threshold_col = [c for c in df.columns if c.endswith("__kolmogorov_smirnov__upper_threshold")]
    alert_col = [c for c in df.columns if c.endswith("__kolmogorov_smirnov__alert")]

    if not value_col or not upper_threshold_col:
        raise KeyError("Cannot find KS statistic or threshold columns in drift_results output.")

    value_col = value_col[0]
    thresh_col = upper_threshold_col[0]

    # Plotting all features row-wise
    n_features = len(feature_columns)
    fig, axes = plt.subplots(n_features, 1, figsize=(12, 3 * n_features), sharex=True)
    if n_features == 1:
        axes = [axes]

    sns.set_style("whitegrid")

    for ax, feature in zip(axes, feature_columns):
        # Finding the corresponding KS column in the dataframe
        ks_col_name = f"{feature}__kolmogorov_smirnov__value"
        thresh_col_name = f"{feature}__kolmogorov_smirnov__upper_threshold"
        if ks_col_name not in df.columns or thresh_col_name not in df.columns:
            print(f"Warning: {feature} not found in drift results. Skipping.")
            continue

        subdf = df[["chunk_index", ks_col_name, thresh_col_name]].copy()
        subdf.columns = ["chunk_index", "statistic", "threshold"]

        sns.lineplot(
            data=subdf,
            x="chunk_index",
            y="statistic",
            color="blue",
            ax=ax,
            label="KS Statistic"
        )
        ax.axhline(
            y=subdf["threshold"].iloc[0],
            color="red",
            linestyle="--",
            label="Threshold"
        )
        ax.set_title(f"{feature} ({scenario_name})", fontsize=10)
        ax.set_ylabel("KS Statistic")
        ax.set_xlabel("Chunk Index")
        ax.legend(loc="upper right", fontsize=8)
        ax.set_xticks(range(10))
        ax.grid(True, alpha=0.3)
        ax.set_ylim(-0.05, 1.05)

    plt.tight_layout()
    plt.show()

    # Formulating the summary table indicating the number of chunks exceeding threshold per feature
    univariate_drift_summary_list = []
    for feature in feature_columns:
        ks_col_name = f"{feature}__kolmogorov_smirnov__value"
        thresh_col_name = f"{feature}__kolmogorov_smirnov__upper_threshold"
        if ks_col_name not in df.columns or thresh_col_name not in df.columns:
            drift_count = 0
        else:
            drift_count = (df[ks_col_name] > df[thresh_col_name]).sum()
        univariate_drift_summary_list.append({"feature": feature, "chunk_drift_count": drift_count})

    univariate_drift_summary = pd.DataFrame(univariate_drift_summary_list)

    print("Univariate Drift Summary Table:")
    display(univariate_drift_summary)

    return univariate_drift_summary
    
In [243]:
##################################
# Detecting univariate drift for baseline control
##################################
univariate_drift_analysis_p1 = detect_univariate_drift(p1, p1, FEATURE_COLUMNS, "Baseline Control")
Univariate drift visualization generated for Baseline Control
       chunk                                                                  \
       chunk                                                                   
         key chunk_index start_index end_index start_date end_date    period   
0     [0:99]           0           0        99       None     None  analysis   
1  [100:199]           1         100       199       None     None  analysis   
2  [200:299]           2         200       299       None     None  analysis   
3  [300:399]           3         300       399       None     None  analysis   
4  [400:499]           4         400       499       None     None  analysis   
5  [500:599]           5         500       599       None     None  analysis   
6  [600:699]           6         600       699       None     None  analysis   
7  [700:799]           7         700       799       None     None  analysis   
8  [800:899]           8         800       899       None     None  analysis   
9  [900:999]           9         900       999       None     None  analysis   

           area_mean                                  ...       texture_mean  \
  kolmogorov_smirnov                                  ... kolmogorov_smirnov   
               value upper_threshold lower_threshold  ...    lower_threshold   
0              0.040        0.150643            None  ...               None   
1              0.089        0.150643            None  ...               None   
2              0.068        0.150643            None  ...               None   
3              0.117        0.150643            None  ...               None   
4              0.113        0.150643            None  ...               None   
5              0.055        0.150643            None  ...               None   
6              0.061        0.150643            None  ...               None   
7              0.060        0.150643            None  ...               None   
8              0.045        0.150643            None  ...               None   
9              0.089        0.150643            None  ...               None   

                 texture_se                                         \
         kolmogorov_smirnov                                          
   alert              value upper_threshold lower_threshold  alert   
0  False              0.066         0.09887            None  False   
1  False              0.061         0.09887            None  False   
2  False              0.085         0.09887            None  False   
3  False              0.054         0.09887            None  False   
4  False              0.061         0.09887            None  False   
5  False              0.036         0.09887            None  False   
6  False              0.068         0.09887            None  False   
7  False              0.060         0.09887            None  False   
8  False              0.073         0.09887            None  False   
9  False              0.064         0.09887            None  False   

       texture_worst                                         
  kolmogorov_smirnov                                         
               value upper_threshold lower_threshold  alert  
0              0.097        0.140661            None  False  
1              0.074        0.140661            None  False  
2              0.053        0.140661            None  False  
3              0.076        0.140661            None  False  
4              0.064        0.140661            None  False  
5              0.068        0.140661            None  False  
6              0.125        0.140661            None  False  
7              0.060        0.140661            None  False  
8              0.090        0.140661            None  False  
9              0.091        0.140661            None  False  

[10 rows x 127 columns]
In [244]:
##################################
# Visualizing univariate drift for baseline control
##################################
univariate_drift_analysis_visualization_p1 = plot_univariate_drift_summary(univariate_drift_analysis_p1, FEATURE_COLUMNS, "Baseline Control")
No description has been provided for this image
Univariate Drift Summary Table:
feature chunk_drift_count
0 radius_mean 0
1 texture_mean 0
2 perimeter_mean 0
3 area_mean 0
4 smoothness_mean 0
5 compactness_mean 0
6 concavity_mean 0
7 concave points_mean 0
8 symmetry_mean 0
9 fractal_dimension_mean 0
10 radius_se 0
11 texture_se 0
12 perimeter_se 0
13 area_se 0
14 smoothness_se 0
15 compactness_se 0
16 concavity_se 0
17 concave points_se 0
18 symmetry_se 0
19 fractal_dimension_se 0
20 radius_worst 0
21 texture_worst 0
22 perimeter_worst 0
23 area_worst 0
24 smoothness_worst 0
25 compactness_worst 0
26 concavity_worst 0
27 concave points_worst 0
28 symmetry_worst 0
29 fractal_dimension_worst 0
In [245]:
##################################
# Defining a function for fitting
# a CBPE estimator using the simulated baseline control and
# estimating CBPE performance per chunk for a given scenario
##################################
def estimate_chunk_cbpe_performance(reference_df, target_df, model_pipeline, feature_columns, target_col='diagnosis', label_map={'B':0, 'M':1}, chunk_col='__chunk'):
    """
    Fits CBPE Estimator on baseline data and estimate performance per chunk on scenario data.
    """

    # Preparing the reference data
    X_ref = reference_df[feature_columns]
    y_ref = reference_df[target_col].map(label_map)
    y_pred_ref, y_proba_ref = compute_preds_and_proba(model_pipeline, X_ref)

    ref_df = reference_df.copy()
    ref_df['y_true'] = y_ref
    ref_df['y_pred'] = y_pred_ref
    ref_df['y_pred_proba'] = y_proba_ref

    # Defining a chunker
    chunker = DefaultChunker()

    # Fitting CBPE on the reference data
    cbpe_estimator = CBPE(
        y_true='y_true',
        y_pred_proba='y_pred_proba',
        y_pred='y_pred',
        metrics=['roc_auc'],
        problem_type='classification_binary',
        chunker=chunker
    )
    cbpe_estimator.fit(ref_df)

    # Preparing the scenario data
    X_target = target_df[feature_columns]
    y_pred_target, y_proba_target = compute_preds_and_proba(model_pipeline, X_target)

    target_df_copy = target_df.copy()
    target_df_copy['y_pred'] = y_pred_target
    target_df_copy['y_pred_proba'] = y_proba_target

    # Estimating CBPE performance per chunk on the scenario data
    perf_results = cbpe_estimator.estimate(target_df_copy)

    chunk_cbpe_performance_summary = perf_results.to_df()

    print("Chunk CBPE Performance Summary Table:")
    display(chunk_cbpe_performance_summary)

    return chunk_cbpe_performance_summary
    
In [246]:
##################################
# Defining a function for visualizing
# CBPE performance for a given scenario
##################################
def plot_chunk_cbpe_performance(performance_df, baseline_name="Baseline", scenario_name="Scenario"):
    """
    Visualize CBPE-estimated ROC-AUC evolution per chunk for both reference and analysis periods,
    and summarize performance degradation alerts.
    """

    # Flattening the MultiIndex columns
    df = performance_df.copy()
    df.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in df.columns]
    
    # Ensure expected columns exist
    required_cols = [
        'chunk_chunk_index', 'chunk_period', 'roc_auc_value',
        'roc_auc_lower_confidence_boundary', 'roc_auc_upper_confidence_boundary', 'roc_auc_alert'
    ]
    missing = [c for c in required_cols if c not in df.columns]
    if missing:
        raise KeyError(f"Missing expected columns: {missing}")
    
    # Splitting results for reference and analysis scenarios
    df_ref = df[df['chunk_period'] == 'reference']
    df_analysis = df[df['chunk_period'] == 'analysis']

    # Using the reference confidence boundaries for both plots
    ref_bounds = df_ref[['chunk_chunk_index', 'roc_auc_lower_confidence_boundary', 'roc_auc_upper_confidence_boundary']]
    df_analysis = pd.merge(
        df_analysis.drop(columns=['roc_auc_lower_confidence_boundary', 'roc_auc_upper_confidence_boundary']),
        ref_bounds,
        on='chunk_chunk_index',
        how='left'
    )
    
    # Create a two-row plot
    fig, axes = plt.subplots(2, 1, figsize=(12, 7), sharex=True)
    sns.set_style("whitegrid")
    
    # Generating a helper function for consistent plotting
    def plot_cbpe_line(sub_df, ax, color, title):
        # Plotting the estimated performance
        sns.lineplot(
            data=sub_df,
            x='chunk_chunk_index',
            y='roc_auc_value',
            color=color,
            marker='o',
            ax=ax,
            label='Estimated ROC-AUC'
        )
    
        # Plotting the confidence region
        ax.fill_between(
            sub_df['chunk_chunk_index'],
            sub_df['roc_auc_lower_confidence_boundary'],
            sub_df['roc_auc_upper_confidence_boundary'],
            color=color,
            alpha=0.15
        )
    
        # Plotting the confidence boundary lines
        sns.lineplot(
            data=sub_df,
            x='chunk_chunk_index',
            y='roc_auc_upper_confidence_boundary',
            color='black',
            linestyle='-',
            ax=ax,
            label='Upper Confidence Bound'
        )
        sns.lineplot(
            data=sub_df,
            x='chunk_chunk_index',
            y='roc_auc_lower_confidence_boundary',
            color='red',
            linestyle='--',
            ax=ax,
            label='Lower Confidence Bound'
        )
    
        ax.set_title(title, fontsize=12)
        ax.set_xlabel("Chunk Index (Simulated Time)")
        ax.set_ylabel("CBPE-Estimated ROC-AUC")
        ax.set_ylim(0.8, 1.01)
        ax.set_yticks(np.arange(0.8, 1.01, 0.05))
        ax.set_xticks(range(10))
        ax.grid(True, alpha=0.3)
        ax.legend(loc='lower right', fontsize=8)
    
    # Plotting the reference CBPE ROC-AUC estimates
    plot_cbpe_line(df_ref, axes[0], color='blue', title=f"{baseline_name} (Reference Period)")
    
    # Plotting the scenario CBPE ROC-AUC estimates
    plot_cbpe_line(df_analysis, axes[1], color='orange', title=f"{scenario_name} (Analysis Period)")
    
    plt.tight_layout()
    plt.show()
    
    # # Formulating the summary table indicating the number of AUC-ROC alerts per chunk
    chunk_cbpe_performance_summary = (
        df.groupby(['chunk_chunk_index', 'chunk_period'])['roc_auc_alert']
        .sum()
        .reset_index()
        .rename(columns={'roc_auc_alert': 'cbpe_roc_auc_alert_count'})
    )
    
    print("Chunk CBPE Performance Summary Table:")
    display(chunk_cbpe_performance_summary)
    
    return chunk_cbpe_performance_summary
In [247]:
##################################
# Estimating CBPE performance for baseline control
##################################
chunk_cbpe_performance_analysis_p1 = estimate_chunk_cbpe_performance(p1, p1, boosted_cb_optimal, FEATURE_COLUMNS)
Chunk CBPE Performance Summary Table:
chunk roc_auc
key chunk_index start_index end_index start_date end_date period value sampling_error realized upper_confidence_boundary lower_confidence_boundary upper_threshold lower_threshold alert
0 [0:99] 0 0 99 None None reference 0.994438 0.00367 0.9924 1.0 0.983427 1 0.976509 False
1 [100:199] 1 100 199 None None reference 0.997225 0.00367 0.9972 1.0 0.986213 1 0.976509 False
2 [200:299] 2 200 299 None None reference 0.997486 0.00367 1.0000 1.0 0.986475 1 0.976509 False
3 [300:399] 3 300 399 None None reference 0.995054 0.00367 0.9924 1.0 0.984043 1 0.976509 False
4 [400:499] 4 400 499 None None reference 0.996828 0.00367 0.9928 1.0 0.985817 1 0.976509 False
5 [500:599] 5 500 599 None None reference 0.995083 0.00367 0.9936 1.0 0.984072 1 0.976509 False
6 [600:699] 6 600 699 None None reference 0.996123 0.00367 0.9960 1.0 0.985111 1 0.976509 False
7 [700:799] 7 700 799 None None reference 0.995848 0.00367 0.9960 1.0 0.984836 1 0.976509 False
8 [800:899] 8 800 899 None None reference 0.991864 0.00367 0.9780 1.0 0.980853 1 0.976509 False
9 [900:999] 9 900 999 None None reference 0.994634 0.00367 0.9964 1.0 0.983623 1 0.976509 False
10 [0:99] 0 0 99 None None analysis 0.994438 0.00367 NaN 1.0 0.983427 1 0.976509 False
11 [100:199] 1 100 199 None None analysis 0.997225 0.00367 NaN 1.0 0.986213 1 0.976509 False
12 [200:299] 2 200 299 None None analysis 0.997486 0.00367 NaN 1.0 0.986475 1 0.976509 False
13 [300:399] 3 300 399 None None analysis 0.995054 0.00367 NaN 1.0 0.984043 1 0.976509 False
14 [400:499] 4 400 499 None None analysis 0.996828 0.00367 NaN 1.0 0.985817 1 0.976509 False
15 [500:599] 5 500 599 None None analysis 0.995083 0.00367 NaN 1.0 0.984072 1 0.976509 False
16 [600:699] 6 600 699 None None analysis 0.996123 0.00367 NaN 1.0 0.985111 1 0.976509 False
17 [700:799] 7 700 799 None None analysis 0.995848 0.00367 NaN 1.0 0.984836 1 0.976509 False
18 [800:899] 8 800 899 None None analysis 0.991864 0.00367 NaN 1.0 0.980853 1 0.976509 False
19 [900:999] 9 900 999 None None analysis 0.994634 0.00367 NaN 1.0 0.983623 1 0.976509 False
In [248]:
##################################
# Visualizing CBPE performance for baseline control
##################################
chunk_cbpe_performance_analysis_visualization_p1 = plot_chunk_cbpe_performance(chunk_cbpe_performance_analysis_p1, baseline_name="Baseline Control", scenario_name="Baseline Control")
No description has been provided for this image
Chunk CBPE Performance Summary Table:
chunk_chunk_index chunk_period cbpe_roc_auc_alert_count
0 0 analysis 0
1 0 reference 0
2 1 analysis 0
3 1 reference 0
4 2 analysis 0
5 2 reference 0
6 3 analysis 0
7 3 reference 0
8 4 analysis 0
9 4 reference 0
10 5 analysis 0
11 5 reference 0
12 6 analysis 0
13 6 reference 0
14 7 analysis 0
15 7 reference 0
16 8 analysis 0
17 8 reference 0
18 9 analysis 0
19 9 reference 0

1.9.2 Simulated Covariate Drift¶

In [249]:
##################################
# Defining the covariate drift-specific parameters
# for the post-model deployment scenario simulation
##################################
COVARIATE_DRIFT_FEATURES = ['radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean',
                            'compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean']
COVARIATE_DRIFT_DELTA = 0.5
COVARIATE_DRIFT_SCALE = 3.5
COVARIATE_DRIFT_RAMP = 15
In [250]:
##################################
# Defining a function for 
# simulating covariate drift
##################################
def simulate_P2_covariate_drift(df):
    # Creating a time-ordered synthetic stream of data chunks
    stream = make_stream_from_dataframe(df)
    # Computing standard deviations of selected features to scale drift magnitudes appropriately
    stds = df[COVARIATE_DRIFT_FEATURES].std()
    # Looping through each simulated chunk (time step)
    for chunk_idx in range(N_CHUNKS):
        # Computing the progression fraction (0 → 1) of the drift ramp over time
        frac = min(1, (chunk_idx+1)/COVARIATE_DRIFT_RAMP)
        # Applying a Boolean mask to isolate current chunk’s samples
        mask = stream['__chunk'] == chunk_idx
        # Applying drift to each feature selected for covariate drift
        for f in COVARIATE_DRIFT_FEATURES:
            # Applying an additive mean shift proportional to standard deviation and drift fraction
            add = COVARIATE_DRIFT_DELTA * stds[f] * frac
            # Applying a multiplicative scale shift proportional to drift progression
            scale = 1 + (COVARIATE_DRIFT_SCALE - 1) * frac
            # Apply both mean and scale shifts to current chunk’s feature values
            stream.loc[mask, f] = stream.loc[mask, f] * scale + add
    # Returning the modified data stream containing simulated covariate drift
    return stream
In [251]:
##################################
# Defining a function for 
# visualizing the boxplot comparison chart
# for both the simulated and baseline control
##################################
def plot_feature_boxplot_comparison(df_base, df_drift, features, scenario_name):
    """Chunk-based boxplots for selected features for Baseline vs Scenario."""
    # Resetting indices to avoid duplicate label issues
    df_base = df_base.reset_index(drop=True) 
    df_drift = df_drift.reset_index(drop=True)
    # Determining the number of features to plot
    n_features = len(features)
    # Creating a vertically stacked subplot layout (one plot per feature)
    fig, axes = plt.subplots(n_features, 1, figsize=(12, 3 * n_features), sharex=True)
    # Ensuring axes is iterable even if there’s only one feature
    if n_features == 1:
        axes = [axes]
    # Iterating through each feature and its corresponding subplot axis
    for ax, f in zip(axes, features):
        # Creating a boxplot showing the distribution of the feature across chunks
        combined_df = pd.concat([ df_base.assign(scenario='Baseline Control'), df_drift.assign(scenario=scenario_name) ], ignore_index=True).dropna(subset=[f, "__chunk"])
        sns.boxplot(
            data=combined_df,
            x="__chunk", y=f, hue="scenario", ax=ax, showfliers=False
        )
        y_min = combined_df[f].min() 
        y_max = combined_df[f].max() 
        y_extension = 0.2 * (y_max - y_min) 
        ax.set_ylim(y_min - y_extension, y_max + y_extension)
        ax.set_title(f"Chunk-wise {f}: {scenario_name} vs Baseline Control")
        ax.set_xlabel("Chunk Index (Simulated Time)")
        ax.set_ylabel(f)
        ax.legend(loc='upper left', bbox_to_anchor=(0, 1))
        ax.set_xticks(range(10))
    plt.tight_layout()
    plt.show()
    
In [252]:
##################################
# Defining a function for 
# visualizing the mean line comparison chart
# for both the simulated and baseline control
##################################
def plot_feature_mean_line(df_base, df_drift, features, scenario_name):
    """Plots per-feature mean values over chunks (one chart per feature) for Baseline vs Scenario."""
    # Computing the chunk-wise mean per feature for both datasets
    base_means = df_base.groupby('__chunk')[features].mean().assign(scenario='Baseline Control')
    drift_means = df_drift.groupby('__chunk')[features].mean().assign(scenario=scenario_name)
    combined = pd.concat([base_means, drift_means])
    melted = combined.reset_index().melt(
        id_vars=['__chunk', 'scenario'],
        var_name='feature',
        value_name='mean_value'
    )

    # Preparing the subplots (one row per feature)
    n_features = len(features)
    fig, axes = plt.subplots(n_features, 1, figsize=(12, 3 * n_features), sharex=True)
    if n_features == 1:
        axes = [axes]

    # Plotting the lineplots for each feature
    for ax, f in zip(axes, features):
        subset = melted[melted['feature'] == f]
        sns.lineplot(
            data=subset,
            x='__chunk',
            y='mean_value',
            hue='scenario',
            ax=ax
        )
        ax.set_title(f"Chunk-wise Mean of {f}: {scenario_name} vs Baseline", fontsize=11)
        ax.set_xlabel("Chunk Index")
        ax.set_ylabel("Mean Value")
        ax.grid(True, alpha=0.3)
        ax.set_xticks(range(10))
        ax.legend(loc='best')

    plt.tight_layout()
    plt.show()
In [253]:
##################################
# Simulating post-deployment data drift scenario 2 = covariate drift
##################################
p2 = simulate_P2_covariate_drift(breast_cancer_monitoring_baseline)
In [254]:
##################################
# Exploring the simulated covariate drift
##################################
display(p2)
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst __chunk __timestamp
0 M 20.440009 29.953748 134.385875 1117.567389 0.117863 0.135572 0.198945 0.078271 0.153581 ... 120.40 1021.0 0.1243 0.17930 0.280300 0.10990 0.1603 0.06818 0 0
1 M 26.098342 23.082082 179.069208 1772.067389 0.155196 0.324805 0.500178 0.213990 0.299181 ... 206.80 2360.0 0.1701 0.69970 0.960800 0.29100 0.4055 0.09789 0 0
2 B 13.241676 17.377082 84.079208 466.567389 0.097399 0.053882 0.003847 0.004737 0.207831 ... 82.08 492.7 0.1166 0.09794 0.005518 0.01667 0.2815 0.07418 0 0
3 B 14.408342 26.313748 92.012542 555.817389 0.096069 0.062539 0.022708 0.016018 0.181114 ... 92.74 622.9 0.1256 0.18040 0.123000 0.06335 0.3100 0.08203 0 0
4 B 12.343342 23.303748 78.642542 405.550723 0.125329 0.071534 0.059073 0.037123 0.203631 ... 73.68 402.8 0.1515 0.10260 0.118100 0.06736 0.2883 0.07748 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 B 38.580091 58.004150 247.465418 1724.740559 0.246825 0.203922 0.066099 0.063622 0.414339 ... 98.27 715.5 0.1287 0.15130 0.062310 0.07963 0.2226 0.07617 9 9
996 B 37.726758 51.150816 244.638751 1638.340559 0.269385 0.304055 0.170712 0.113942 0.466872 ... 99.43 701.9 0.1425 0.25660 0.193500 0.12840 0.2849 0.09031 9 9
997 B 36.046758 43.230816 236.372084 1502.340559 0.291625 0.357388 0.148925 0.095996 0.534339 ... 96.09 630.5 0.1312 0.27760 0.189000 0.07283 0.3184 0.08183 9 9
998 B 24.092091 50.937483 152.265418 705.540559 0.291359 0.174642 0.027112 0.013062 0.586606 ... 56.65 240.1 0.1347 0.07767 0.000000 0.00000 0.3142 0.08116 9 9
999 M 53.940091 68.324150 355.758751 3294.340559 0.275625 0.442455 0.705779 0.319462 0.597006 ... 148.70 1589.0 0.1275 0.38610 0.567300 0.17320 0.3305 0.08465 9 9

1000 rows × 33 columns

In [255]:
##################################
# Visualizing baseline feature variability
# for the simulated covariate drift scenario
# and baseline control
##################################
plot_feature_boxplot_comparison(p1, p2, COVARIATE_DRIFT_FEATURES, "Covariate Drift")
No description has been provided for this image
In [256]:
##################################
# Visualizing baseline feature variability
# for the simulated covariate drift scenario
# and baseline control
##################################
plot_feature_mean_line(p1, p2, COVARIATE_DRIFT_FEATURES, "Covariate Drift")
No description has been provided for this image
In [257]:
##################################
# Inspecting class distribution
# for the simulated covariate drift scenario
# and baseline control
##################################
for feat in COVARIATE_DRIFT_FEATURES:
    fig, ax = plt.subplots(1, 2, figsize=(14, 3), sharey=True)
    combined_min = min(p1[feat].min(), p2[feat].min()) 
    combined_max = max(p1[feat].max(), p2[feat].max()) 
    y_margin = 0.05 * (combined_max - combined_min)
    y_min, y_max = combined_min - y_margin, combined_max + y_margin
    sns.boxplot(x="diagnosis", y=feat, data=p1, ax=ax[0], order=['M', 'B'])
    ax[0].set_title(f"{feat} by Label - Baseline Control")
    ax[0].set_ylim(y_min, y_max)
    sns.boxplot(x="diagnosis", y=feat, data=p2, ax=ax[1], order=['M', 'B'])
    ax[1].set_title(f"{feat} by Label - Covariate Drift")
    ax[1].set_ylim(y_min, y_max)
    plt.show() 

    
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [258]:
##################################
# Detecting univariate drift for covariate drift
##################################
univariate_drift_analysis_p2 = detect_univariate_drift(p1, p2, FEATURE_COLUMNS, "Covariate Drift")
Univariate drift visualization generated for Covariate Drift
       chunk                                                                  \
       chunk                                                                   
         key chunk_index start_index end_index start_date end_date    period   
0     [0:99]           0           0        99       None     None  analysis   
1  [100:199]           1         100       199       None     None  analysis   
2  [200:299]           2         200       299       None     None  analysis   
3  [300:399]           3         300       399       None     None  analysis   
4  [400:499]           4         400       499       None     None  analysis   
5  [500:599]           5         500       599       None     None  analysis   
6  [600:699]           6         600       699       None     None  analysis   
7  [700:799]           7         700       799       None     None  analysis   
8  [800:899]           8         800       899       None     None  analysis   
9  [900:999]           9         900       999       None     None  analysis   

           area_mean                                  ...       texture_mean  \
  kolmogorov_smirnov                                  ... kolmogorov_smirnov   
               value upper_threshold lower_threshold  ...    lower_threshold   
0              0.179        0.150643            None  ...               None   
1              0.321        0.150643            None  ...               None   
2              0.399        0.150643            None  ...               None   
3              0.527        0.150643            None  ...               None   
4              0.562        0.150643            None  ...               None   
5              0.594        0.150643            None  ...               None   
6              0.632        0.150643            None  ...               None   
7              0.660        0.150643            None  ...               None   
8              0.697        0.150643            None  ...               None   
9              0.712        0.150643            None  ...               None   

                texture_se                                         \
        kolmogorov_smirnov                                          
  alert              value upper_threshold lower_threshold  alert   
0  True              0.066         0.09887            None  False   
1  True              0.061         0.09887            None  False   
2  True              0.085         0.09887            None  False   
3  True              0.054         0.09887            None  False   
4  True              0.061         0.09887            None  False   
5  True              0.036         0.09887            None  False   
6  True              0.068         0.09887            None  False   
7  True              0.060         0.09887            None  False   
8  True              0.073         0.09887            None  False   
9  True              0.064         0.09887            None  False   

       texture_worst                                         
  kolmogorov_smirnov                                         
               value upper_threshold lower_threshold  alert  
0              0.097        0.140661            None  False  
1              0.074        0.140661            None  False  
2              0.053        0.140661            None  False  
3              0.076        0.140661            None  False  
4              0.064        0.140661            None  False  
5              0.068        0.140661            None  False  
6              0.125        0.140661            None  False  
7              0.060        0.140661            None  False  
8              0.090        0.140661            None  False  
9              0.091        0.140661            None  False  

[10 rows x 127 columns]
In [259]:
##################################
# Visualizing univariate drift for covariate drift
##################################
univariate_drift_analysis_visualization_p2 = plot_univariate_drift_summary(univariate_drift_analysis_p2, FEATURE_COLUMNS, "Covariate Drift")
No description has been provided for this image
Univariate Drift Summary Table:
feature chunk_drift_count
0 radius_mean 10
1 texture_mean 10
2 perimeter_mean 10
3 area_mean 10
4 smoothness_mean 10
5 compactness_mean 10
6 concavity_mean 10
7 concave points_mean 10
8 symmetry_mean 10
9 fractal_dimension_mean 10
10 radius_se 0
11 texture_se 0
12 perimeter_se 0
13 area_se 0
14 smoothness_se 0
15 compactness_se 0
16 concavity_se 0
17 concave points_se 0
18 symmetry_se 0
19 fractal_dimension_se 0
20 radius_worst 0
21 texture_worst 0
22 perimeter_worst 0
23 area_worst 0
24 smoothness_worst 0
25 compactness_worst 0
26 concavity_worst 0
27 concave points_worst 0
28 symmetry_worst 0
29 fractal_dimension_worst 0
In [260]:
##################################
# Estimating CBPE performance for covariate drift
##################################
chunk_cbpe_performance_analysis_p2 = estimate_chunk_cbpe_performance(p1, p2, boosted_cb_optimal, FEATURE_COLUMNS)
Chunk CBPE Performance Summary Table:
chunk roc_auc
key chunk_index start_index end_index start_date end_date period value sampling_error realized upper_confidence_boundary lower_confidence_boundary upper_threshold lower_threshold alert
0 [0:99] 0 0 99 None None reference 0.994438 0.00367 0.9924 1.000000 0.983427 1 0.976509 False
1 [100:199] 1 100 199 None None reference 0.997225 0.00367 0.9972 1.000000 0.986213 1 0.976509 False
2 [200:299] 2 200 299 None None reference 0.997486 0.00367 1.0000 1.000000 0.986475 1 0.976509 False
3 [300:399] 3 300 399 None None reference 0.995054 0.00367 0.9924 1.000000 0.984043 1 0.976509 False
4 [400:499] 4 400 499 None None reference 0.996828 0.00367 0.9928 1.000000 0.985817 1 0.976509 False
5 [500:599] 5 500 599 None None reference 0.995083 0.00367 0.9936 1.000000 0.984072 1 0.976509 False
6 [600:699] 6 600 699 None None reference 0.996123 0.00367 0.9960 1.000000 0.985111 1 0.976509 False
7 [700:799] 7 700 799 None None reference 0.995848 0.00367 0.9960 1.000000 0.984836 1 0.976509 False
8 [800:899] 8 800 899 None None reference 0.991864 0.00367 0.9780 1.000000 0.980853 1 0.976509 False
9 [900:999] 9 900 999 None None reference 0.994634 0.00367 0.9964 1.000000 0.983623 1 0.976509 False
10 [0:99] 0 0 99 None None analysis 0.993605 0.00367 NaN 1.000000 0.982594 1 0.976509 False
11 [100:199] 1 100 199 None None analysis 0.995865 0.00367 NaN 1.000000 0.984854 1 0.976509 False
12 [200:299] 2 200 299 None None analysis 0.973295 0.00367 NaN 0.984306 0.962283 1 0.976509 True
13 [300:399] 3 300 399 None None analysis 0.980802 0.00367 NaN 0.991813 0.969790 1 0.976509 False
14 [400:499] 4 400 499 None None analysis 0.974865 0.00367 NaN 0.985877 0.963854 1 0.976509 True
15 [500:599] 5 500 599 None None analysis 0.984459 0.00367 NaN 0.995471 0.973448 1 0.976509 False
16 [600:699] 6 600 699 None None analysis 0.951432 0.00367 NaN 0.962443 0.940420 1 0.976509 True
17 [700:799] 7 700 799 None None analysis 0.936062 0.00367 NaN 0.947074 0.925051 1 0.976509 True
18 [800:899] 8 800 899 None None analysis 0.842986 0.00367 NaN 0.853997 0.831974 1 0.976509 True
19 [900:999] 9 900 999 None None analysis 0.804781 0.00367 NaN 0.815793 0.793770 1 0.976509 True
In [261]:
##################################
# Visualizing CBPE performance for covariate drift
##################################
chunk_cbpe_performance_analysis_visualization_p2 = plot_chunk_cbpe_performance(chunk_cbpe_performance_analysis_p2, baseline_name="Baseline Control", scenario_name="Covariate Drift")
No description has been provided for this image
Chunk CBPE Performance Summary Table:
chunk_chunk_index chunk_period cbpe_roc_auc_alert_count
0 0 analysis 0
1 0 reference 0
2 1 analysis 0
3 1 reference 0
4 2 analysis 1
5 2 reference 0
6 3 analysis 0
7 3 reference 0
8 4 analysis 1
9 4 reference 0
10 5 analysis 0
11 5 reference 0
12 6 analysis 1
13 6 reference 0
14 7 analysis 1
15 7 reference 0
16 8 analysis 1
17 8 reference 0
18 9 analysis 1
19 9 reference 0

1.9.3 Simulated Prior Shift¶

In [262]:
##################################
# Defining the prior-shift parameters
# for the post-model deployment scenario simulation
##################################
PRIOR_SHIFT_START_P = 0.30
PRIOR_SHIFT_END_P = 0.80
PRIOR_SHIFT_RAMP = 5
In [263]:
##################################
# Defining a function for 
# simulating prior shift
##################################
def simulate_P3_prior_shift(df):
    # Initialiing a random number generator for reproducibility
    rng = np.random.RandomState(RANDOM_STATE)
    # Separating the dataset into positive (M) and negative (B) subsets
    df_pos = df[df[TARGET_COL].map(LABEL_MAP)==1]
    df_neg = df[df[TARGET_COL].map(LABEL_MAP)==0]
    # Creating an empty list to collect chunked DataFrames
    chunks = []
    # Iterating over each simulated monitoring chunk
    for c in range(N_CHUNKS):
        # Calculating the current progression fraction (0 → 1)
        frac = min(1, (c+1)/PRIOR_SHIFT_RAMP)
        # Gradually changing the class prevalence (probability of positives)
        p = PRIOR_SHIFT_START_P + (PRIOR_SHIFT_END_P - PRIOR_SHIFT_START_P) * frac
        # Determining the number of positive and negative samples in the particular chunk
        n_pos = int(CHUNK_SIZE * p)
        n_neg = CHUNK_SIZE - n_pos
        # Sampling from positive and negative pools with replacement
        pos = df_pos.sample(n=n_pos, replace=True, random_state=rng.randint(0,2**31 -1))
        neg = df_neg.sample(n=n_neg, replace=True, random_state=rng.randint(0,2**31 -1))
        # Combining and shuffling the sampled data to avoid order bias
        chunk = pd.concat([pos, neg]).sample(frac=1, random_state=rng.randint(0,2**31 -1))
        # Assigning synthetic time and chunk identifiers
        chunk['__chunk']=c; 
        chunk['__timestamp']=c
        # Store the chunk in the list
        chunks.append(chunk)
    # Concatenating all chunks into a single DataFrame for analysis    
    return pd.concat(chunks, ignore_index=True)
In [264]:
##################################
# Defining a function for 
# plotting class proportion ('M' vs 'B') across chunks
# for both the simulated and baseline control
##################################
def plot_class_proportion(df_base, df_shift, scenario_name):
    def prop(df):
        return df.groupby('__chunk')['diagnosis'].value_counts(normalize=True).unstack().fillna(0)
    base_prop = prop(df_base)
    shift_prop = prop(df_shift)
    fig, ax = plt.subplots(figsize=(14, 3))
    sns.lineplot(data=base_prop['M'], label='Baseline M', ax=ax)
    sns.lineplot(data=shift_prop['M'], label=f'{scenario_name} M', ax=ax)
    ax.set_title(f"Proportion of Malignant (M) per Chunk: {scenario_name} vs Baseline Control")
    ax.set_xlabel("Chunk Index")
    ax.set_ylabel("Proportion of 'M'")
    ax.set_ylim(-0.1, 1)
    ax.set_xticks(range(10))
    ax.legend()
    plt.show()
    
In [265]:
##################################
# Simulating post-deployment data drift scenario 3 = prior shift
##################################
p3 = simulate_P3_prior_shift(breast_cancer_monitoring_baseline)
In [266]:
##################################
# Exploring the simulated prior shift
##################################
display(p3)
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst __chunk __timestamp
0 B 10.80 21.98 68.79 359.9 0.08801 0.05743 0.036140 0.014040 0.2016 ... 83.69 489.5 0.13030 0.16960 0.19270 0.07485 0.2965 0.07662 0 0
1 M 22.27 19.67 152.80 1509.0 0.13260 0.27680 0.426400 0.182300 0.2556 ... 206.80 2360.0 0.17010 0.69970 0.96080 0.29100 0.4055 0.09789 0 0
2 B 9.72 18.22 60.73 288.1 0.06950 0.02344 0.000000 0.000000 0.1653 ... 62.25 303.8 0.07117 0.02729 0.00000 0.00000 0.1909 0.06559 0 0
3 B 11.51 23.93 74.52 403.5 0.09261 0.10210 0.111200 0.041050 0.1388 ... 82.28 474.2 0.12980 0.25170 0.36300 0.09653 0.2112 0.08732 0 0
4 B 13.50 12.71 85.69 566.2 0.07376 0.03614 0.002758 0.004419 0.1365 ... 95.48 698.7 0.09023 0.05836 0.01379 0.02210 0.2267 0.06192 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 M 17.68 20.74 117.40 963.7 0.11150 0.16650 0.185500 0.105400 0.1971 ... 132.90 1302.0 0.14180 0.34980 0.35830 0.15150 0.2463 0.07738 9 9
996 B 12.96 18.29 84.18 525.2 0.07351 0.07899 0.040570 0.018830 0.1874 ... 96.31 621.9 0.09329 0.23180 0.16040 0.06608 0.3207 0.07247 9 9
997 M 18.66 17.12 121.40 1077.0 0.10540 0.11000 0.145700 0.086650 0.1966 ... 145.40 1549.0 0.15030 0.22910 0.32720 0.16740 0.2894 0.08456 9 9
998 M 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.112700 0.074000 0.1794 ... 153.20 1606.0 0.14420 0.25760 0.37840 0.19320 0.3063 0.08368 9 9
999 M 19.79 25.12 130.40 1192.0 0.10150 0.15890 0.254500 0.114900 0.2202 ... 148.70 1589.0 0.12750 0.38610 0.56730 0.17320 0.3305 0.08465 9 9

1000 rows × 33 columns

In [267]:
##################################
# Inspecting class balance stability
# for the simulated prior shift scenario
# and baseline control
##################################
plot_class_proportion(p1, p3, "Prior Shift")
No description has been provided for this image
In [268]:
##################################
# Visualizing baseline feature variability
# for the simulated prior shift scenario
# and baseline control
##################################
plot_feature_boxplot_comparison(p1, p3, FEATURE_COLUMNS, "Prior Shift")
No description has been provided for this image
In [269]:
##################################
# Inspecting class distribution
# for the simulated prior shift scenario
# and baseline control
##################################
for feat in FEATURE_COLUMNS:
    fig, ax = plt.subplots(1, 2, figsize=(14, 3), sharey=True)
    combined_min = min(p1[feat].min(), p3[feat].min()) 
    combined_max = max(p1[feat].max(), p3[feat].max()) 
    y_margin = 0.05 * (combined_max - combined_min)
    y_min, y_max = combined_min - y_margin, combined_max + y_margin
    sns.boxplot(x="diagnosis", y=feat, data=p1, ax=ax[0], order=['M', 'B'])
    ax[0].set_title(f"{feat} by Label - Baseline Control")
    ax[0].set_ylim(y_min, y_max)
    sns.boxplot(x="diagnosis", y=feat, data=p3, ax=ax[1], order=['M', 'B'])
    ax[1].set_title(f"{feat} by Label - Prior Shift")
    ax[1].set_ylim(y_min, y_max)
    plt.show()
    
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [270]:
##################################
# Detecting univariate drift for prior shift
##################################
univariate_drift_analysis_p3 = detect_univariate_drift(p1, p3, FEATURE_COLUMNS, "Prior Shift")
Univariate drift visualization generated for Prior Shift
       chunk                                                                  \
       chunk                                                                   
         key chunk_index start_index end_index start_date end_date    period   
0     [0:99]           0           0        99       None     None  analysis   
1  [100:199]           1         100       199       None     None  analysis   
2  [200:299]           2         200       299       None     None  analysis   
3  [300:399]           3         300       399       None     None  analysis   
4  [400:499]           4         400       499       None     None  analysis   
5  [500:599]           5         500       599       None     None  analysis   
6  [600:699]           6         600       699       None     None  analysis   
7  [700:799]           7         700       799       None     None  analysis   
8  [800:899]           8         800       899       None     None  analysis   
9  [900:999]           9         900       999       None     None  analysis   

           area_mean                                  ...       texture_mean  \
  kolmogorov_smirnov                                  ... kolmogorov_smirnov   
               value upper_threshold lower_threshold  ...    lower_threshold   
0              0.102        0.150643            None  ...               None   
1              0.089        0.150643            None  ...               None   
2              0.168        0.150643            None  ...               None   
3              0.232        0.150643            None  ...               None   
4              0.203        0.150643            None  ...               None   
5              0.243        0.150643            None  ...               None   
6              0.237        0.150643            None  ...               None   
7              0.232        0.150643            None  ...               None   
8              0.289        0.150643            None  ...               None   
9              0.347        0.150643            None  ...               None   

                 texture_se                                         \
         kolmogorov_smirnov                                          
   alert              value upper_threshold lower_threshold  alert   
0  False              0.083         0.09887            None  False   
1  False              0.061         0.09887            None  False   
2  False              0.077         0.09887            None  False   
3  False              0.062         0.09887            None  False   
4   True              0.054         0.09887            None  False   
5   True              0.067         0.09887            None  False   
6  False              0.071         0.09887            None  False   
7   True              0.080         0.09887            None  False   
8   True              0.101         0.09887            None   True   
9   True              0.054         0.09887            None  False   

       texture_worst                                         
  kolmogorov_smirnov                                         
               value upper_threshold lower_threshold  alert  
0              0.121        0.140661            None  False  
1              0.074        0.140661            None  False  
2              0.091        0.140661            None  False  
3              0.114        0.140661            None  False  
4              0.174        0.140661            None   True  
5              0.221        0.140661            None   True  
6              0.131        0.140661            None  False  
7              0.130        0.140661            None  False  
8              0.214        0.140661            None   True  
9              0.180        0.140661            None   True  

[10 rows x 127 columns]
In [271]:
##################################
# Visualizing univariate drift for prior shift
##################################
univariate_drift_analysis_visualization_p3 = plot_univariate_drift_summary(univariate_drift_analysis_p3, FEATURE_COLUMNS, "Prior Shift")
No description has been provided for this image
Univariate Drift Summary Table:
feature chunk_drift_count
0 radius_mean 8
1 texture_mean 5
2 perimeter_mean 8
3 area_mean 8
4 smoothness_mean 5
5 compactness_mean 7
6 concavity_mean 8
7 concave points_mean 8
8 symmetry_mean 3
9 fractal_dimension_mean 0
10 radius_se 7
11 texture_se 1
12 perimeter_se 7
13 area_se 7
14 smoothness_se 0
15 compactness_se 2
16 concavity_se 5
17 concave points_se 6
18 symmetry_se 1
19 fractal_dimension_se 0
20 radius_worst 8
21 texture_worst 4
22 perimeter_worst 8
23 area_worst 8
24 smoothness_worst 7
25 compactness_worst 7
26 concavity_worst 7
27 concave points_worst 7
28 symmetry_worst 1
29 fractal_dimension_worst 2
In [272]:
##################################
# Estimating CBPE performance for prior shift
##################################
chunk_cbpe_performance_analysis_p3 = estimate_chunk_cbpe_performance(p1, p3, boosted_cb_optimal, FEATURE_COLUMNS)
Chunk CBPE Performance Summary Table:
chunk roc_auc
key chunk_index start_index end_index start_date end_date period value sampling_error realized upper_confidence_boundary lower_confidence_boundary upper_threshold lower_threshold alert
0 [0:99] 0 0 99 None None reference 0.994438 0.00367 0.9924 1.0 0.983427 1 0.976509 False
1 [100:199] 1 100 199 None None reference 0.997225 0.00367 0.9972 1.0 0.986213 1 0.976509 False
2 [200:299] 2 200 299 None None reference 0.997486 0.00367 1.0000 1.0 0.986475 1 0.976509 False
3 [300:399] 3 300 399 None None reference 0.995054 0.00367 0.9924 1.0 0.984043 1 0.976509 False
4 [400:499] 4 400 499 None None reference 0.996828 0.00367 0.9928 1.0 0.985817 1 0.976509 False
5 [500:599] 5 500 599 None None reference 0.995083 0.00367 0.9936 1.0 0.984072 1 0.976509 False
6 [600:699] 6 600 699 None None reference 0.996123 0.00367 0.9960 1.0 0.985111 1 0.976509 False
7 [700:799] 7 700 799 None None reference 0.995848 0.00367 0.9960 1.0 0.984836 1 0.976509 False
8 [800:899] 8 800 899 None None reference 0.991864 0.00367 0.9780 1.0 0.980853 1 0.976509 False
9 [900:999] 9 900 999 None None reference 0.994634 0.00367 0.9964 1.0 0.983623 1 0.976509 False
10 [0:99] 0 0 99 None None analysis 0.993098 0.00367 NaN 1.0 0.982087 1 0.976509 False
11 [100:199] 1 100 199 None None analysis 0.997225 0.00367 NaN 1.0 0.986213 1 0.976509 False
12 [200:299] 2 200 299 None None analysis 0.997405 0.00367 NaN 1.0 0.986394 1 0.976509 False
13 [300:399] 3 300 399 None None analysis 0.996471 0.00367 NaN 1.0 0.985460 1 0.976509 False
14 [400:499] 4 400 499 None None analysis 0.996010 0.00367 NaN 1.0 0.984998 1 0.976509 False
15 [500:599] 5 500 599 None None analysis 0.996252 0.00367 NaN 1.0 0.985240 1 0.976509 False
16 [600:699] 6 600 699 None None analysis 0.992699 0.00367 NaN 1.0 0.981688 1 0.976509 False
17 [700:799] 7 700 799 None None analysis 0.992849 0.00367 NaN 1.0 0.981838 1 0.976509 False
18 [800:899] 8 800 899 None None analysis 0.992422 0.00367 NaN 1.0 0.981410 1 0.976509 False
19 [900:999] 9 900 999 None None analysis 0.993719 0.00367 NaN 1.0 0.982708 1 0.976509 False
In [273]:
##################################
# Visualizing CBPE performance for prior shift
##################################
chunk_cbpe_performance_analysis_visualization_p3 = plot_chunk_cbpe_performance(chunk_cbpe_performance_analysis_p3, baseline_name="Baseline Control", scenario_name="Prior Shift")
No description has been provided for this image
Chunk CBPE Performance Summary Table:
chunk_chunk_index chunk_period cbpe_roc_auc_alert_count
0 0 analysis 0
1 0 reference 0
2 1 analysis 0
3 1 reference 0
4 2 analysis 0
5 2 reference 0
6 3 analysis 0
7 3 reference 0
8 4 analysis 0
9 4 reference 0
10 5 analysis 0
11 5 reference 0
12 6 analysis 0
13 6 reference 0
14 7 analysis 0
15 7 reference 0
16 8 analysis 0
17 8 reference 0
18 9 analysis 0
19 9 reference 0

1.9.4 Simulated Concept Drift¶

In [274]:
##################################
# Defining the concept drift-specific parameters
# for the post-model deployment scenario simulation
##################################
CONCEPT_DRIFT_SLICE_FEATURES = ['radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean',
'compactness_se','concavity_se','concave points_se','symmetry_se','fractal_dimension_se',
'radius_worst','perimeter_worst', 'smoothness_worst','concavity_worst','symmetry_worst']
CONCEPT_DRIFT_SLICE_THRESHOLD_QUANTILE = 0.75
CONCEPT_DRIFT_FLIP_FRACTION = 1.0
CONCEPT_DRIFT_RAMP = 10
In [275]:
##################################
# Defining a function for 
# simulating concept drift
##################################
def simulate_P4_concept_drift(df):
    # Initializing a random number generator for reproducibility
    rng = np.random.RandomState(RANDOM_STATE)
    # Creating a time-ordered synthetic stream of data chunks
    stream = make_stream_from_dataframe(df)
    # Iterating through each feature defined to induce localized concept drift
    for feat in CONCEPT_DRIFT_SLICE_FEATURES:
        # Determining a threshold (quantile-based) to define the region affected by concept drift
        thr = df[feat].quantile(CONCEPT_DRIFT_SLICE_THRESHOLD_QUANTILE)
        # Looping through each synthetic chunk (simulated monitoring time)
        for c in range(N_CHUNKS):
            # Computing progression of concept drift (0 → 1) across ramp duration
            frac = min(1.0, (c+1)/CONCEPT_DRIFT_RAMP)
            # Identifying data points within the current chunk and above the feature threshold
            mask = (stream['__chunk']==c) & (stream[feat]>=thr)
            # Extracting indices of samples eligible for label flipping
            idxs = stream[mask].index
            # Computing number of samples to flip based on drift fraction and configured flip rate
            n_flip = int(len(idxs) * CONCEPT_DRIFT_FLIP_FRACTION * frac)
            # Performing label flipping only if there are samples to modify
            if n_flip>0:
                flip = rng.choice(idxs, n_flip, replace=False)
                # Swapping labels: 'B' becomes 'M', and 'M' becomes 'B'
                stream.loc[flip, TARGET_COL] = stream.loc[flip, TARGET_COL].map({'B':'M','M':'B'})
    # Returning the modified data stream containing simulated concept drift
    return stream
In [276]:
##################################
# Simulating post-deployment data drift scenario 4 = concept drift
##################################
p4 = simulate_P4_concept_drift(breast_cancer_monitoring_baseline)
In [277]:
##################################
# Exploring the simulated concept drift
##################################
display(p4)
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst __chunk __timestamp
0 B 17.420 25.56 114.50 948.0 0.10060 0.11460 0.168200 0.065970 0.1308 ... 120.40 1021.0 0.1243 0.17930 0.280300 0.10990 0.1603 0.06818 0 0
1 B 22.270 19.67 152.80 1509.0 0.13260 0.27680 0.426400 0.182300 0.2556 ... 206.80 2360.0 0.1701 0.69970 0.960800 0.29100 0.4055 0.09789 0 0
2 B 11.250 14.78 71.38 390.0 0.08306 0.04458 0.000974 0.002941 0.1773 ... 82.08 492.7 0.1166 0.09794 0.005518 0.01667 0.2815 0.07418 0 0
3 B 12.250 22.44 78.18 466.5 0.08192 0.05200 0.017140 0.012610 0.1544 ... 92.74 622.9 0.1256 0.18040 0.123000 0.06335 0.3100 0.08203 0 0
4 B 10.480 19.86 66.72 337.7 0.10700 0.05971 0.048310 0.030700 0.1737 ... 73.68 402.8 0.1515 0.10260 0.118100 0.06736 0.2883 0.07748 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 B 14.030 21.25 89.79 603.4 0.09070 0.06945 0.014620 0.018960 0.1517 ... 98.27 715.5 0.1287 0.15130 0.062310 0.07963 0.2226 0.07617 9 9
996 B 13.710 18.68 88.73 571.0 0.09916 0.10700 0.053850 0.037830 0.1714 ... 99.43 701.9 0.1425 0.25660 0.193500 0.12840 0.2849 0.09031 9 9
997 B 13.080 15.71 85.63 520.0 0.10750 0.12700 0.045680 0.031100 0.1967 ... 96.09 630.5 0.1312 0.27760 0.189000 0.07283 0.3184 0.08183 9 9
998 M 8.597 18.60 54.09 221.2 0.10740 0.05847 0.000000 0.000000 0.2163 ... 56.65 240.1 0.1347 0.07767 0.000000 0.00000 0.3142 0.08116 9 9
999 B 19.790 25.12 130.40 1192.0 0.10150 0.15890 0.254500 0.114900 0.2202 ... 148.70 1589.0 0.1275 0.38610 0.567300 0.17320 0.3305 0.08465 9 9

1000 rows × 33 columns

In [278]:
##################################
# Inspecting class balance stability
# for the simulated concept drift scenario
# and baseline control
##################################
plot_class_proportion(p1, p4, "Concept Drift")
No description has been provided for this image
In [279]:
##################################
# Visualizing baseline feature variability
# for the simulated concept drift scenario
# and baseline control
##################################
plot_feature_boxplot_comparison(p1, p4, CONCEPT_DRIFT_SLICE_FEATURES, "Concept Drift")
No description has been provided for this image
In [280]:
##################################
# Inspecting class distribution
# for the simulated concept drift scenario
# and baseline control
##################################
for feat in CONCEPT_DRIFT_SLICE_FEATURES:
    fig, ax = plt.subplots(1, 2, figsize=(14, 3), sharey=True)
    combined_min = min(p1[feat].min(), p4[feat].min()) 
    combined_max = max(p1[feat].max(), p4[feat].max()) 
    y_margin = 0.05 * (combined_max - combined_min)
    y_min, y_max = combined_min - y_margin, combined_max + y_margin
    sns.boxplot(x="diagnosis", y=feat, data=p1, ax=ax[0], order=['M', 'B'])
    ax[0].set_title(f"{feat} by Label - Baseline Control")
    ax[0].set_ylim(y_min, y_max)
    sns.boxplot(x="diagnosis", y=feat, data=p4, ax=ax[1], order=['M', 'B'])
    ax[1].set_title(f"{feat} by Label - Concept Drift")
    ax[1].set_ylim(y_min, y_max)
    plt.show()
    
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [281]:
##################################
# Detecting univariate drift for concept drift
##################################
univariate_drift_analysis_p4 = detect_univariate_drift(p1, p4, FEATURE_COLUMNS, "Concept Drift")
Univariate drift visualization generated for Concept Drift
       chunk                                                                  \
       chunk                                                                   
         key chunk_index start_index end_index start_date end_date    period   
0     [0:99]           0           0        99       None     None  analysis   
1  [100:199]           1         100       199       None     None  analysis   
2  [200:299]           2         200       299       None     None  analysis   
3  [300:399]           3         300       399       None     None  analysis   
4  [400:499]           4         400       499       None     None  analysis   
5  [500:599]           5         500       599       None     None  analysis   
6  [600:699]           6         600       699       None     None  analysis   
7  [700:799]           7         700       799       None     None  analysis   
8  [800:899]           8         800       899       None     None  analysis   
9  [900:999]           9         900       999       None     None  analysis   

           area_mean                                  ...       texture_mean  \
  kolmogorov_smirnov                                  ... kolmogorov_smirnov   
               value upper_threshold lower_threshold  ...    lower_threshold   
0              0.040        0.150643            None  ...               None   
1              0.089        0.150643            None  ...               None   
2              0.068        0.150643            None  ...               None   
3              0.117        0.150643            None  ...               None   
4              0.113        0.150643            None  ...               None   
5              0.055        0.150643            None  ...               None   
6              0.061        0.150643            None  ...               None   
7              0.060        0.150643            None  ...               None   
8              0.045        0.150643            None  ...               None   
9              0.089        0.150643            None  ...               None   

                 texture_se                                         \
         kolmogorov_smirnov                                          
   alert              value upper_threshold lower_threshold  alert   
0  False              0.066         0.09887            None  False   
1  False              0.061         0.09887            None  False   
2  False              0.085         0.09887            None  False   
3  False              0.054         0.09887            None  False   
4  False              0.061         0.09887            None  False   
5  False              0.036         0.09887            None  False   
6  False              0.068         0.09887            None  False   
7  False              0.060         0.09887            None  False   
8  False              0.073         0.09887            None  False   
9  False              0.064         0.09887            None  False   

       texture_worst                                         
  kolmogorov_smirnov                                         
               value upper_threshold lower_threshold  alert  
0              0.097        0.140661            None  False  
1              0.074        0.140661            None  False  
2              0.053        0.140661            None  False  
3              0.076        0.140661            None  False  
4              0.064        0.140661            None  False  
5              0.068        0.140661            None  False  
6              0.125        0.140661            None  False  
7              0.060        0.140661            None  False  
8              0.090        0.140661            None  False  
9              0.091        0.140661            None  False  

[10 rows x 127 columns]
In [282]:
##################################
# Visualizing univariate drift for concept drift
##################################
univariate_drift_analysis_visualization_p4 = plot_univariate_drift_summary(univariate_drift_analysis_p4, FEATURE_COLUMNS, "Concept Drift")
No description has been provided for this image
Univariate Drift Summary Table:
feature chunk_drift_count
0 radius_mean 0
1 texture_mean 0
2 perimeter_mean 0
3 area_mean 0
4 smoothness_mean 0
5 compactness_mean 0
6 concavity_mean 0
7 concave points_mean 0
8 symmetry_mean 0
9 fractal_dimension_mean 0
10 radius_se 0
11 texture_se 0
12 perimeter_se 0
13 area_se 0
14 smoothness_se 0
15 compactness_se 0
16 concavity_se 0
17 concave points_se 0
18 symmetry_se 0
19 fractal_dimension_se 0
20 radius_worst 0
21 texture_worst 0
22 perimeter_worst 0
23 area_worst 0
24 smoothness_worst 0
25 compactness_worst 0
26 concavity_worst 0
27 concave points_worst 0
28 symmetry_worst 0
29 fractal_dimension_worst 0
In [283]:
##################################
# Estimating CBPE performance for concept drift
##################################
chunk_cbpe_performance_analysis_p4 = estimate_chunk_cbpe_performance(p1, p4, boosted_cb_optimal, FEATURE_COLUMNS)
Chunk CBPE Performance Summary Table:
chunk roc_auc
key chunk_index start_index end_index start_date end_date period value sampling_error realized upper_confidence_boundary lower_confidence_boundary upper_threshold lower_threshold alert
0 [0:99] 0 0 99 None None reference 0.994438 0.00367 0.9924 1.0 0.983427 1 0.976509 False
1 [100:199] 1 100 199 None None reference 0.997225 0.00367 0.9972 1.0 0.986213 1 0.976509 False
2 [200:299] 2 200 299 None None reference 0.997486 0.00367 1.0000 1.0 0.986475 1 0.976509 False
3 [300:399] 3 300 399 None None reference 0.995054 0.00367 0.9924 1.0 0.984043 1 0.976509 False
4 [400:499] 4 400 499 None None reference 0.996828 0.00367 0.9928 1.0 0.985817 1 0.976509 False
5 [500:599] 5 500 599 None None reference 0.995083 0.00367 0.9936 1.0 0.984072 1 0.976509 False
6 [600:699] 6 600 699 None None reference 0.996123 0.00367 0.9960 1.0 0.985111 1 0.976509 False
7 [700:799] 7 700 799 None None reference 0.995848 0.00367 0.9960 1.0 0.984836 1 0.976509 False
8 [800:899] 8 800 899 None None reference 0.991864 0.00367 0.9780 1.0 0.980853 1 0.976509 False
9 [900:999] 9 900 999 None None reference 0.994634 0.00367 0.9964 1.0 0.983623 1 0.976509 False
10 [0:99] 0 0 99 None None analysis 0.994438 0.00367 NaN 1.0 0.983427 1 0.976509 False
11 [100:199] 1 100 199 None None analysis 0.997225 0.00367 NaN 1.0 0.986213 1 0.976509 False
12 [200:299] 2 200 299 None None analysis 0.997486 0.00367 NaN 1.0 0.986475 1 0.976509 False
13 [300:399] 3 300 399 None None analysis 0.995054 0.00367 NaN 1.0 0.984043 1 0.976509 False
14 [400:499] 4 400 499 None None analysis 0.996828 0.00367 NaN 1.0 0.985817 1 0.976509 False
15 [500:599] 5 500 599 None None analysis 0.995083 0.00367 NaN 1.0 0.984072 1 0.976509 False
16 [600:699] 6 600 699 None None analysis 0.996123 0.00367 NaN 1.0 0.985111 1 0.976509 False
17 [700:799] 7 700 799 None None analysis 0.995848 0.00367 NaN 1.0 0.984836 1 0.976509 False
18 [800:899] 8 800 899 None None analysis 0.991864 0.00367 NaN 1.0 0.980853 1 0.976509 False
19 [900:999] 9 900 999 None None analysis 0.994634 0.00367 NaN 1.0 0.983623 1 0.976509 False
In [284]:
##################################
# Visualizing CBPE performance for concept drift
##################################
chunk_cbpe_performance_analysis_visualization_p4 = plot_chunk_cbpe_performance(chunk_cbpe_performance_analysis_p4, baseline_name="Baseline Control", scenario_name="Concept Drift")
No description has been provided for this image
Chunk CBPE Performance Summary Table:
chunk_chunk_index chunk_period cbpe_roc_auc_alert_count
0 0 analysis 0
1 0 reference 0
2 1 analysis 0
3 1 reference 0
4 2 analysis 0
5 2 reference 0
6 3 analysis 0
7 3 reference 0
8 4 analysis 0
9 4 reference 0
10 5 analysis 0
11 5 reference 0
12 6 analysis 0
13 6 reference 0
14 7 analysis 0
15 7 reference 0
16 8 analysis 0
17 8 reference 0
18 9 analysis 0
19 9 reference 0

1.9.5 Simulated Missingness Spike¶

In [285]:
##################################
# Defining the missingness spike-specific parameters
# for the post-model deployment scenario simulation
##################################
MCAR_FEATURES = ['radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean', 'compactness_mean','concavity_mean']
MAR_FEATURES = ['compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean']
MISSINGNESS_SPIKE_FEATURES = list(dict.fromkeys(MCAR_FEATURES + MAR_FEATURES))
MISSINGNESS_SPIKE_INTENSITY = 0.8
MISSINGNESS_SPIKE_LENGTH = 6
MISSINGNESS_PROLONGED_INCREASE = 0.50
MISSINGNESS_PROLONGED_LENGTH = 5
In [286]:
##################################
# Defining a function for 
# simulating missingness spike
##################################
def simulate_P5_missingness_spike(df):
    # Initializing RNG for reproducibility
    rng = np.random.RandomState(RANDOM_STATE)

    # Creating time-ordered synthetic stream of data chunks
    stream = make_stream_from_dataframe(df)

    # Defining MCAR spike window
    spike_start, spike_end = N_CHUNKS // 3, N_CHUNKS // 3 + MISSINGNESS_SPIKE_LENGTH

    # Simulating MCAR (Missing Completely At Random)
    for c in range(spike_start, spike_end):
        # Identifing rows belonging to the current chunk
        mask = stream['__chunk'] == c

        for f in MCAR_FEATURES:
            # Skipping if feature not present in data
            if f not in stream.columns:
                continue

            # Indices of rows in this chunk
            idx = stream[mask].index

            # Randomly selecting a fraction of rows to make missing
            n_missing = int(len(idx) * MISSINGNESS_SPIKE_INTENSITY)
            if n_missing == 0:
                continue

            miss = rng.choice(idx, n_missing, replace=False)

            # Apply missingness
            stream.loc[miss, f] = np.nan

    # Simulating MAR (Missing At Random) based on a reference feature
    for c in range(N_CHUNKS):
        mask = stream['__chunk'] == c

        # Proceeding only if the predictor feature exists
        if 'area_mean' not in stream.columns:
            continue

        # Identify high values of 'area_mean' (top 20%)
        high_area = stream.loc[mask & (stream['area_mean'] > stream['area_mean'].quantile(0.8))].index
        if len(high_area) == 0:
            continue

        # Applying MAR missingness to multiple MAR features
        for f in MAR_FEATURES:
            if f not in stream.columns:
                continue

            n_mar = int(len(high_area) * 0.2)
            if n_mar == 0:
                continue

            miss = rng.choice(high_area, n_mar, replace=False)
            stream.loc[miss, f] = np.nan

    # Simulating Prolonged missingness pattern after spikes 
    for c in range(spike_end, spike_end + MISSINGNESS_PROLONGED_LENGTH):
        mask = stream['__chunk'] == c
        for f in MCAR_FEATURES:
            if f not in stream.columns:
                continue
            idx = stream[mask].index
            n_missing = int(len(idx) * MISSINGNESS_PROLONGED_INCREASE)
            if n_missing == 0:
                continue
            miss = rng.choice(idx, n_missing, replace=False)
            stream.loc[miss, f] = np.nan

    # Returning the modified stream with simulated missingness
    return stream
In [287]:
##################################
# Defining a function for 
# plotting missing fraction per chunk
# for both the simulated and baseline control
##################################
def plot_missingness(df_base, df_missing, features, scenario_name):
    # Computing the missing fraction per chunk
    def missing_rate(df):
        return df.groupby('__chunk')[features].apply(lambda x: x.isna().mean())

    # Computing missingness for baseline and simulated datasets
    miss_base = missing_rate(df_base)
    miss_sim = missing_rate(df_missing)

    # Creating a subplot per feature
    n_features = len(features)
    fig, axes = plt.subplots(n_features, 1, figsize=(12, 3 * n_features), sharex=True)
    if n_features == 1:
        axes = [axes]

    # Looping through features and plot both Baseline and Scenario
    for ax, f in zip(axes, features):
        # Plotting baseline missingness
        sns.lineplot(x=miss_base.index, y=miss_base[f], color="#4C72B0", label="Baseline", ax=ax)
        # Plotting simulated scenario missingness
        sns.lineplot(x=miss_sim.index, y=miss_sim[f], color="#DD8452", label=scenario_name, ax=ax)

        ax.set_title(f"Missingness over Time: {f} ({scenario_name} vs Baseline Control)", fontsize=11)
        ax.set_xlabel("Chunk Index")
        ax.set_ylabel("Missing Fraction")
        ax.set_ylim(-0.1, 1)
        ax.set_xticks(range(10))
        ax.grid(True, alpha=0.3)
        ax.legend(loc="best")

    plt.tight_layout()
    plt.show()

    
In [288]:
##################################
# Simulating post-deployment data drift scenario 5 = missingness spike
##################################
p5 = simulate_P5_missingness_spike(breast_cancer_monitoring_baseline)
In [289]:
##################################
# Exploring the simulated missingness spike
##################################
display(p5)
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst __chunk __timestamp
0 M 17.420 25.56 114.50 948.0 0.10060 0.11460 0.168200 0.065970 0.1308 ... 120.40 1021.0 0.1243 0.17930 0.280300 0.10990 0.1603 0.06818 0 0
1 M 22.270 19.67 152.80 1509.0 0.13260 NaN 0.426400 NaN 0.2556 ... 206.80 2360.0 0.1701 0.69970 0.960800 0.29100 0.4055 0.09789 0 0
2 B 11.250 14.78 71.38 390.0 0.08306 0.04458 0.000974 0.002941 0.1773 ... 82.08 492.7 0.1166 0.09794 0.005518 0.01667 0.2815 0.07418 0 0
3 B 12.250 22.44 78.18 466.5 0.08192 0.05200 0.017140 0.012610 0.1544 ... 92.74 622.9 0.1256 0.18040 0.123000 0.06335 0.3100 0.08203 0 0
4 B 10.480 19.86 66.72 337.7 0.10700 0.05971 0.048310 0.030700 0.1737 ... 73.68 402.8 0.1515 0.10260 0.118100 0.06736 0.2883 0.07748 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 B 14.030 21.25 NaN NaN NaN 0.06945 NaN 0.018960 0.1517 ... 98.27 715.5 0.1287 0.15130 0.062310 0.07963 0.2226 0.07617 9 9
996 B NaN NaN NaN 571.0 0.09916 NaN 0.053850 0.037830 0.1714 ... 99.43 701.9 0.1425 0.25660 0.193500 0.12840 0.2849 0.09031 9 9
997 B 13.080 15.71 85.63 NaN NaN NaN 0.045680 0.031100 0.1967 ... 96.09 630.5 0.1312 0.27760 0.189000 0.07283 0.3184 0.08183 9 9
998 B 8.597 18.60 NaN NaN 0.10740 NaN 0.000000 0.000000 0.2163 ... 56.65 240.1 0.1347 0.07767 0.000000 0.00000 0.3142 0.08116 9 9
999 M 19.790 NaN 130.40 1192.0 0.10150 NaN NaN 0.114900 NaN ... 148.70 1589.0 0.1275 0.38610 0.567300 0.17320 0.3305 0.08465 9 9

1000 rows × 33 columns

In [290]:
##################################
# Evaluating baseline missingness
# of the simulated missingness spike scenario
# and the baseline control
##################################
plot_missingness(p1, p5, MISSINGNESS_SPIKE_FEATURES, "Missingness Spike")
No description has been provided for this image
In [291]:
##################################
# Visualizing baseline feature variability
# for the simulated missingness spike scenario
# and baseline control
##################################
plot_feature_boxplot_comparison(p1, p5, MISSINGNESS_SPIKE_FEATURES, "Missingness Spike") 
No description has been provided for this image
In [292]:
##################################
# Inspecting class distribution
# for the simulated missingness spike scenario
# and baseline control
##################################
for feat in MISSINGNESS_SPIKE_FEATURES:
    fig, ax = plt.subplots(1, 2, figsize=(14, 3), sharey=True)
    combined_min = min(p1[feat].min(), p5[feat].min()) 
    combined_max = max(p1[feat].max(), p5[feat].max()) 
    y_margin = 0.05 * (combined_max - combined_min)
    y_min, y_max = combined_min - y_margin, combined_max + y_margin
    sns.boxplot(x="diagnosis", y=feat, data=p1, ax=ax[0], order=['M', 'B'])
    ax[0].set_title(f"{feat} by Label - Baseline Control")
    ax[0].set_ylim(y_min, y_max)
    sns.boxplot(x="diagnosis", y=feat, data=p5, ax=ax[1], order=['M', 'B'])
    ax[1].set_title(f"{feat} by Label - Missingness Spike")
    ax[1].set_ylim(y_min, y_max)
    plt.show() 
    
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [293]:
##################################
# Detecting univariate drift for missingness spike
##################################
univariate_drift_analysis_p5 = detect_univariate_drift(p1, p5, FEATURE_COLUMNS, "Missingness Spike")
Univariate drift visualization generated for Missingness Spike
       chunk                                                                  \
       chunk                                                                   
         key chunk_index start_index end_index start_date end_date    period   
0     [0:99]           0           0        99       None     None  analysis   
1  [100:199]           1         100       199       None     None  analysis   
2  [200:299]           2         200       299       None     None  analysis   
3  [300:399]           3         300       399       None     None  analysis   
4  [400:499]           4         400       499       None     None  analysis   
5  [500:599]           5         500       599       None     None  analysis   
6  [600:699]           6         600       699       None     None  analysis   
7  [700:799]           7         700       799       None     None  analysis   
8  [800:899]           8         800       899       None     None  analysis   
9  [900:999]           9         900       999       None     None  analysis   

           area_mean                                  ...       texture_mean  \
  kolmogorov_smirnov                                  ... kolmogorov_smirnov   
               value upper_threshold lower_threshold  ...    lower_threshold   
0              0.040        0.150643            None  ...               None   
1              0.089        0.150643            None  ...               None   
2              0.068        0.150643            None  ...               None   
3              0.130        0.150643            None  ...               None   
4              0.173        0.150643            None  ...               None   
5              0.121        0.150643            None  ...               None   
6              0.218        0.150643            None  ...               None   
7              0.175        0.150643            None  ...               None   
8              0.115        0.150643            None  ...               None   
9              0.099        0.150643            None  ...               None   

                 texture_se                                         \
         kolmogorov_smirnov                                          
   alert              value upper_threshold lower_threshold  alert   
0  False              0.066         0.09887            None  False   
1  False              0.061         0.09887            None  False   
2  False              0.085         0.09887            None  False   
3   True              0.054         0.09887            None  False   
4   True              0.061         0.09887            None  False   
5  False              0.036         0.09887            None  False   
6   True              0.068         0.09887            None  False   
7  False              0.060         0.09887            None  False   
8   True              0.073         0.09887            None  False   
9  False              0.064         0.09887            None  False   

       texture_worst                                         
  kolmogorov_smirnov                                         
               value upper_threshold lower_threshold  alert  
0              0.097        0.140661            None  False  
1              0.074        0.140661            None  False  
2              0.053        0.140661            None  False  
3              0.076        0.140661            None  False  
4              0.064        0.140661            None  False  
5              0.068        0.140661            None  False  
6              0.125        0.140661            None  False  
7              0.060        0.140661            None  False  
8              0.090        0.140661            None  False  
9              0.091        0.140661            None  False  

[10 rows x 127 columns]
In [294]:
##################################
# Visualizing univariate drift for missingness spike
##################################
univariate_drift_analysis_visualization_p5 = plot_univariate_drift_summary(univariate_drift_analysis_p5, FEATURE_COLUMNS, "Missingness Spike")
No description has been provided for this image
Univariate Drift Summary Table:
feature chunk_drift_count
0 radius_mean 6
1 texture_mean 4
2 perimeter_mean 5
3 area_mean 3
4 smoothness_mean 4
5 compactness_mean 3
6 concavity_mean 6
7 concave points_mean 0
8 symmetry_mean 0
9 fractal_dimension_mean 0
10 radius_se 0
11 texture_se 0
12 perimeter_se 0
13 area_se 0
14 smoothness_se 0
15 compactness_se 0
16 concavity_se 0
17 concave points_se 0
18 symmetry_se 0
19 fractal_dimension_se 0
20 radius_worst 0
21 texture_worst 0
22 perimeter_worst 0
23 area_worst 0
24 smoothness_worst 0
25 compactness_worst 0
26 concavity_worst 0
27 concave points_worst 0
28 symmetry_worst 0
29 fractal_dimension_worst 0
In [295]:
##################################
# Estimating CBPE performance for missingness spike
##################################
chunk_cbpe_performance_analysis_p5 = estimate_chunk_cbpe_performance(p1, p5, boosted_cb_optimal, FEATURE_COLUMNS)
Chunk CBPE Performance Summary Table:
chunk roc_auc
key chunk_index start_index end_index start_date end_date period value sampling_error realized upper_confidence_boundary lower_confidence_boundary upper_threshold lower_threshold alert
0 [0:99] 0 0 99 None None reference 0.994438 0.00367 0.9924 1.000000 0.983427 1 0.976509 False
1 [100:199] 1 100 199 None None reference 0.997225 0.00367 0.9972 1.000000 0.986213 1 0.976509 False
2 [200:299] 2 200 299 None None reference 0.997486 0.00367 1.0000 1.000000 0.986475 1 0.976509 False
3 [300:399] 3 300 399 None None reference 0.995054 0.00367 0.9924 1.000000 0.984043 1 0.976509 False
4 [400:499] 4 400 499 None None reference 0.996828 0.00367 0.9928 1.000000 0.985817 1 0.976509 False
5 [500:599] 5 500 599 None None reference 0.995083 0.00367 0.9936 1.000000 0.984072 1 0.976509 False
6 [600:699] 6 600 699 None None reference 0.996123 0.00367 0.9960 1.000000 0.985111 1 0.976509 False
7 [700:799] 7 700 799 None None reference 0.995848 0.00367 0.9960 1.000000 0.984836 1 0.976509 False
8 [800:899] 8 800 899 None None reference 0.991864 0.00367 0.9780 1.000000 0.980853 1 0.976509 False
9 [900:999] 9 900 999 None None reference 0.994634 0.00367 0.9964 1.000000 0.983623 1 0.976509 False
10 [0:99] 0 0 99 None None analysis 0.994438 0.00367 NaN 1.000000 0.983427 1 0.976509 False
11 [100:199] 1 100 199 None None analysis 0.997225 0.00367 NaN 1.000000 0.986213 1 0.976509 False
12 [200:299] 2 200 299 None None analysis 0.997486 0.00367 NaN 1.000000 0.986475 1 0.976509 False
13 [300:399] 3 300 399 None None analysis 0.991720 0.00367 NaN 1.000000 0.980708 1 0.976509 False
14 [400:499] 4 400 499 None None analysis 0.995417 0.00367 NaN 1.000000 0.984406 1 0.976509 False
15 [500:599] 5 500 599 None None analysis 0.993748 0.00367 NaN 1.000000 0.982737 1 0.976509 False
16 [600:699] 6 600 699 None None analysis 0.988016 0.00367 NaN 0.999027 0.977004 1 0.976509 False
17 [700:799] 7 700 799 None None analysis 0.994608 0.00367 NaN 1.000000 0.983596 1 0.976509 False
18 [800:899] 8 800 899 None None analysis 0.988792 0.00367 NaN 0.999803 0.977780 1 0.976509 False
19 [900:999] 9 900 999 None None analysis 0.992945 0.00367 NaN 1.000000 0.981934 1 0.976509 False
In [296]:
##################################
# Visualizing CBPE performance for missingness spike
##################################
chunk_cbpe_performance_analysis_visualization_p5 = plot_chunk_cbpe_performance(chunk_cbpe_performance_analysis_p5, baseline_name="Baseline Control", scenario_name="Missingness Spike")
No description has been provided for this image
Chunk CBPE Performance Summary Table:
chunk_chunk_index chunk_period cbpe_roc_auc_alert_count
0 0 analysis 0
1 0 reference 0
2 1 analysis 0
3 1 reference 0
4 2 analysis 0
5 2 reference 0
6 3 analysis 0
7 3 reference 0
8 4 analysis 0
9 4 reference 0
10 5 analysis 0
11 5 reference 0
12 6 analysis 0
13 6 reference 0
14 7 analysis 0
15 7 reference 0
16 8 analysis 0
17 8 reference 0
18 9 analysis 0
19 9 reference 0

1.9.6 Simulated Seasonal Pattern¶

In [297]:
##################################
# Defining the seasonal pattern-specific parameters
# for the post-model deployment scenario simulation
##################################
SEASONAL_PATTERN_FEATURES = ['radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean',
'compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean']
SEASONAL_AMPLITUDE_SIGMAS = 0.5
SEASONAL_PERIOD = 10
In [298]:
##################################
# Defining a function for 
# simulating seasonal pattern
##################################
def simulate_P6_seasonal(df):
    # Creating a time-ordered synthetic stream of data chunks
    stream = make_stream_from_dataframe(df)
    # Computing standard deviations of seasonal features (used to scale amplitude)
    stds = df[SEASONAL_PATTERN_FEATURES].std()
    # Looping through each chunk (simulated time window)
    for c in range(N_CHUNKS):
        # Identifying the subset of rows belonging to the current chunk
        mask = stream['__chunk']==c
        # Applying sinusoidal seasonal pattern to each selected feature
        for f in SEASONAL_PATTERN_FEATURES:
            # Defining the amplitude of the seasonal signal (A = SEASONAL_AMPLITUDE_SIGMAS × feature std)
            amp = SEASONAL_AMPLITUDE_SIGMAS * stds[f]
            # Applying sinusoidal variation based on the chunk index (acting as a proxy for time)
            stream.loc[mask, f] += amp * np.sin(2 * np.pi * c / SEASONAL_PERIOD)
    # Returning the modified data stream with simulated seasonality
    return stream
In [299]:
##################################
# Simulating post-deployment data drift scenario 6 = seasonal pattern
##################################
p6 = simulate_P6_seasonal(breast_cancer_monitoring_baseline)
In [300]:
##################################
# Exploring the simulated seasonal pattern
##################################
display(p6)
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst __chunk __timestamp
0 M 17.420000 25.560000 114.500000 948.000000 0.100600 0.114600 0.168200 0.065970 0.130800 ... 120.40 1021.0 0.1243 0.17930 0.280300 0.10990 0.1603 0.06818 0 0
1 M 22.270000 19.670000 152.800000 1509.000000 0.132600 0.276800 0.426400 0.182300 0.255600 ... 206.80 2360.0 0.1701 0.69970 0.960800 0.29100 0.4055 0.09789 0 0
2 B 11.250000 14.780000 71.380000 390.000000 0.083060 0.044580 0.000974 0.002941 0.177300 ... 82.08 492.7 0.1166 0.09794 0.005518 0.01667 0.2815 0.07418 0 0
3 B 12.250000 22.440000 78.180000 466.500000 0.081920 0.052000 0.017140 0.012610 0.154400 ... 92.74 622.9 0.1256 0.18040 0.123000 0.06335 0.3100 0.08203 0 0
4 B 10.480000 19.860000 66.720000 337.700000 0.107000 0.059710 0.048310 0.030700 0.173700 ... 73.68 402.8 0.1515 0.10260 0.118100 0.06736 0.2883 0.07748 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 B 13.001295 20.070771 82.714167 501.412888 0.086328 0.052944 -0.009284 0.007443 0.143055 ... 98.27 715.5 0.1287 0.15130 0.062310 0.07963 0.2226 0.07617 9 9
996 B 12.681295 17.500771 81.654167 469.012888 0.094788 0.090494 0.029946 0.026313 0.162755 ... 99.43 701.9 0.1425 0.25660 0.193500 0.12840 0.2849 0.09031 9 9
997 B 12.051295 14.530771 78.554167 418.012888 0.103128 0.110494 0.021776 0.019583 0.188055 ... 96.09 630.5 0.1312 0.27760 0.189000 0.07283 0.3184 0.08183 9 9
998 B 7.568295 17.420771 47.014167 119.212888 0.103028 0.041964 -0.023904 -0.011517 0.207655 ... 56.65 240.1 0.1347 0.07767 0.000000 0.00000 0.3142 0.08116 9 9
999 M 18.761295 23.940771 123.324167 1090.012888 0.097128 0.142394 0.230596 0.103383 0.211555 ... 148.70 1589.0 0.1275 0.38610 0.567300 0.17320 0.3305 0.08465 9 9

1000 rows × 33 columns

In [301]:
##################################
# Visualizing baseline feature variability
# of the simulated seasonal pattern scenario
# and the baseline control
##################################
plot_feature_mean_line(p1, p6, SEASONAL_PATTERN_FEATURES, "Seasonal Pattern")
No description has been provided for this image
In [302]:
##################################
# Visualizing baseline feature variability
# for the simulated seasonal pattern scenario
# and baseline control
##################################
plot_feature_boxplot_comparison(p1, p6, SEASONAL_PATTERN_FEATURES, "Seasonal Pattern")
No description has been provided for this image
In [303]:
##################################
# Inspecting class distribution
# for the simulated seasonal pattern scenario
# and baseline control
##################################
for feat in SEASONAL_PATTERN_FEATURES:
    fig, ax = plt.subplots(1, 2, figsize=(14, 3), sharey=True)
    combined_min = min(p1[feat].min(), p6[feat].min()) 
    combined_max = max(p1[feat].max(), p6[feat].max()) 
    y_margin = 0.05 * (combined_max - combined_min)
    y_min, y_max = combined_min - y_margin, combined_max + y_margin
    sns.boxplot(x="diagnosis", y=feat, data=p1, ax=ax[0], order=['M', 'B'])
    ax[0].set_title(f"{feat} by Label - Baseline Control")
    ax[0].set_ylim(y_min, y_max)
    sns.boxplot(x="diagnosis", y=feat, data=p6, ax=ax[1], order=['M', 'B'])
    ax[1].set_title(f"{feat} by Label - Seasonal Pattern")
    ax[1].set_ylim(y_min, y_max)
    plt.show()
    
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [304]:
##################################
# Detecting univariate drift for seasonal pattern
##################################
univariate_drift_analysis_p6 = detect_univariate_drift(p1, p6, FEATURE_COLUMNS, "Seasonal Pattern")
Univariate drift visualization generated for Seasonal Pattern
       chunk                                                                  \
       chunk                                                                   
         key chunk_index start_index end_index start_date end_date    period   
0     [0:99]           0           0        99       None     None  analysis   
1  [100:199]           1         100       199       None     None  analysis   
2  [200:299]           2         200       299       None     None  analysis   
3  [300:399]           3         300       399       None     None  analysis   
4  [400:499]           4         400       499       None     None  analysis   
5  [500:599]           5         500       599       None     None  analysis   
6  [600:699]           6         600       699       None     None  analysis   
7  [700:799]           7         700       799       None     None  analysis   
8  [800:899]           8         800       899       None     None  analysis   
9  [900:999]           9         900       999       None     None  analysis   

           area_mean                                  ...       texture_mean  \
  kolmogorov_smirnov                                  ... kolmogorov_smirnov   
               value upper_threshold lower_threshold  ...    lower_threshold   
0              0.040        0.150643            None  ...               None   
1              0.234        0.150643            None  ...               None   
2              0.314        0.150643            None  ...               None   
3              0.338        0.150643            None  ...               None   
4              0.243        0.150643            None  ...               None   
5              0.055        0.150643            None  ...               None   
6              0.252        0.150643            None  ...               None   
7              0.348        0.150643            None  ...               None   
8              0.291        0.150643            None  ...               None   
9              0.162        0.150643            None  ...               None   

                 texture_se                                         \
         kolmogorov_smirnov                                          
   alert              value upper_threshold lower_threshold  alert   
0  False              0.066         0.09887            None  False   
1   True              0.061         0.09887            None  False   
2   True              0.085         0.09887            None  False   
3   True              0.054         0.09887            None  False   
4   True              0.061         0.09887            None  False   
5  False              0.036         0.09887            None  False   
6   True              0.068         0.09887            None  False   
7   True              0.060         0.09887            None  False   
8   True              0.073         0.09887            None  False   
9   True              0.064         0.09887            None  False   

       texture_worst                                         
  kolmogorov_smirnov                                         
               value upper_threshold lower_threshold  alert  
0              0.097        0.140661            None  False  
1              0.074        0.140661            None  False  
2              0.053        0.140661            None  False  
3              0.076        0.140661            None  False  
4              0.064        0.140661            None  False  
5              0.068        0.140661            None  False  
6              0.125        0.140661            None  False  
7              0.060        0.140661            None  False  
8              0.090        0.140661            None  False  
9              0.091        0.140661            None  False  

[10 rows x 127 columns]
In [305]:
##################################
# Visualizing univariate drift for seasonal pattern
##################################
univariate_drift_analysis_visualization_p6 = plot_univariate_drift_summary(univariate_drift_analysis_p6, FEATURE_COLUMNS, "Seasonal Pattern")
No description has been provided for this image
Univariate Drift Summary Table:
feature chunk_drift_count
0 radius_mean 7
1 texture_mean 8
2 perimeter_mean 8
3 area_mean 8
4 smoothness_mean 7
5 compactness_mean 8
6 concavity_mean 8
7 concave points_mean 8
8 symmetry_mean 7
9 fractal_dimension_mean 8
10 radius_se 0
11 texture_se 0
12 perimeter_se 0
13 area_se 0
14 smoothness_se 0
15 compactness_se 0
16 concavity_se 0
17 concave points_se 0
18 symmetry_se 0
19 fractal_dimension_se 0
20 radius_worst 0
21 texture_worst 0
22 perimeter_worst 0
23 area_worst 0
24 smoothness_worst 0
25 compactness_worst 0
26 concavity_worst 0
27 concave points_worst 0
28 symmetry_worst 0
29 fractal_dimension_worst 0
In [306]:
##################################
# Estimating CBPE performance for seasonal pattern
##################################
chunk_cbpe_performance_analysis_p6 = estimate_chunk_cbpe_performance(p1, p6, boosted_cb_optimal, FEATURE_COLUMNS)
Chunk CBPE Performance Summary Table:
chunk roc_auc
key chunk_index start_index end_index start_date end_date period value sampling_error realized upper_confidence_boundary lower_confidence_boundary upper_threshold lower_threshold alert
0 [0:99] 0 0 99 None None reference 0.994438 0.00367 0.9924 1.000000 0.983427 1 0.976509 False
1 [100:199] 1 100 199 None None reference 0.997225 0.00367 0.9972 1.000000 0.986213 1 0.976509 False
2 [200:299] 2 200 299 None None reference 0.997486 0.00367 1.0000 1.000000 0.986475 1 0.976509 False
3 [300:399] 3 300 399 None None reference 0.995054 0.00367 0.9924 1.000000 0.984043 1 0.976509 False
4 [400:499] 4 400 499 None None reference 0.996828 0.00367 0.9928 1.000000 0.985817 1 0.976509 False
5 [500:599] 5 500 599 None None reference 0.995083 0.00367 0.9936 1.000000 0.984072 1 0.976509 False
6 [600:699] 6 600 699 None None reference 0.996123 0.00367 0.9960 1.000000 0.985111 1 0.976509 False
7 [700:799] 7 700 799 None None reference 0.995848 0.00367 0.9960 1.000000 0.984836 1 0.976509 False
8 [800:899] 8 800 899 None None reference 0.991864 0.00367 0.9780 1.000000 0.980853 1 0.976509 False
9 [900:999] 9 900 999 None None reference 0.994634 0.00367 0.9964 1.000000 0.983623 1 0.976509 False
10 [0:99] 0 0 99 None None analysis 0.994438 0.00367 NaN 1.000000 0.983427 1 0.976509 False
11 [100:199] 1 100 199 None None analysis 0.997635 0.00367 NaN 1.000000 0.986624 1 0.976509 False
12 [200:299] 2 200 299 None None analysis 0.996538 0.00367 NaN 1.000000 0.985527 1 0.976509 False
13 [300:399] 3 300 399 None None analysis 0.987125 0.00367 NaN 0.998136 0.976113 1 0.976509 False
14 [400:499] 4 400 499 None None analysis 0.996764 0.00367 NaN 1.000000 0.985753 1 0.976509 False
15 [500:599] 5 500 599 None None analysis 0.995083 0.00367 NaN 1.000000 0.984072 1 0.976509 False
16 [600:699] 6 600 699 None None analysis 0.994447 0.00367 NaN 1.000000 0.983435 1 0.976509 False
17 [700:799] 7 700 799 None None analysis 0.995009 0.00367 NaN 1.000000 0.983997 1 0.976509 False
18 [800:899] 8 800 899 None None analysis 0.991854 0.00367 NaN 1.000000 0.980842 1 0.976509 False
19 [900:999] 9 900 999 None None analysis 0.993473 0.00367 NaN 1.000000 0.982462 1 0.976509 False
In [307]:
##################################
# Visualizing CBPE performance for seasonal pattern
##################################
chunk_cbpe_performance_analysis_visualization_p6 = plot_chunk_cbpe_performance(chunk_cbpe_performance_analysis_p6, baseline_name="Baseline Control", scenario_name="Seasonal Pattern")
No description has been provided for this image
Chunk CBPE Performance Summary Table:
chunk_chunk_index chunk_period cbpe_roc_auc_alert_count
0 0 analysis 0
1 0 reference 0
2 1 analysis 0
3 1 reference 0
4 2 analysis 0
5 2 reference 0
6 3 analysis 0
7 3 reference 0
8 4 analysis 0
9 4 reference 0
10 5 analysis 0
11 5 reference 0
12 6 analysis 0
13 6 reference 0
14 7 analysis 0
15 7 reference 0
16 8 analysis 0
17 8 reference 0
18 9 analysis 0
19 9 reference 0

1.10. Consolidated Findings ¶

2. Summary ¶

3. References ¶

  • [Book] Reliable Machine Learning by Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley and Todd Underwood
  • [Book] Designing Machine Learning Systems by Chip Huyen
  • [Book] Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson and Michael Munn
  • [Book] Machine Learning Engineering by Andriy Burkov
  • [Book] Engineering MLOps by Emmanuel Raj
  • [Book] Introducing MLOps by Mark Treveil, Nicolas Omont, Clément Stenac, Kenji Lefevre, Du Phan, Joachim Zentici, Adrien Lavoillotte, Makoto Miyazaki and Lynn Heidmann
  • [Book] Practical MLOps by Noah Gift and Alfredo Deza
  • [Book] Data Science on AWS by Chris Fregly and Antje Barth
  • [Book] Ensemble Methods for Machine Learning by Gautam Kunapuli
  • [Book] Applied Predictive Modeling by Max Kuhn and Kjell Johnson
  • [Book] An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani
  • [Book] Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou
  • [Book] Effective XGBoost: Optimizing, Tuning, Understanding, and Deploying Classification Models (Treading on Python) by Matt Harrison, Edward Krueger, Alex Rook, Ronald Legere and Bojan Tunguz
  • [Python Library API] nannyML by NannyML Team
  • [Python Library API] NumPy by NumPy Team
  • [Python Library API] pandas by Pandas Team
  • [Python Library API] seaborn by Seaborn Team
  • [Python Library API] matplotlib.pyplot by MatPlotLib Team
  • [Python Library API] itertools by Python Team
  • [Python Library API] sklearn.experimental by Scikit-Learn Team
  • [Python Library API] sklearn.preprocessing by Scikit-Learn Team
  • [Python Library API] scipy by SciPy Team
  • [Python Library API] sklearn.tree by Scikit-Learn Team
  • [Python Library API] sklearn.ensemble by Scikit-Learn Team
  • [Python Library API] sklearn.metrics by Scikit-Learn Team
  • [Python Library API] xgboost by XGBoost Team
  • [Python Library API] lightgbm by LightGBM Team
  • [Python Library API] catboost by CatBoost Team
  • [Python Library API] StatsModels by StatsModels Team
  • [Python Library API] SciPy by SciPy Team
  • [Article] Comprehensive Comparison of ML Model Monitoring Tools: Evidently AI, Alibi Detect, NannyML, WhyLabs, and Fiddler AI by Tanish Kandivlikar (Medium)
  • [Article] Monitoring AI in Production: Introduction to NannyML by Adnan Karol (Medium)
  • [Article] Data Drift Explainability: Interpretable Shift Detection with NannyML by Marco Cerliani (Towards Data Science)
  • [Article] An End-to-End ML Model Monitoring Workflow with NannyML in Python by Bex Tuychiyev (DataCamp)
  • [Article] Detecting Concept Drift: Impact on Machine Learning Performance by Michal Oleszak (NannyML.Com)
  • [Article] Estimating Model Performance Without Labels by Jakub Białek (NannyML.Com)
  • [Article] Monitoring Workflow for Machine Learning Systems by Santiago Víquez (NannyML.Com)
  • [Article] Don’t Let Yourself Be Fooled by Data Drift by Santiago Víquez (NannyML.Com)
  • [Article] Understanding Data Drift: Impact on Machine Learning Model Performance by Jakub Białek (NannyML.Com)
  • [Article] NannyML’s Guide to Data Quality and Covariate Shift by Magdalena Kowalczuk (NannyML.Com)
  • [Article] From Reactive to Proactive: Shift your ML Monitoring Approach by Qiamo (Luca) Zheng (NannyML.Com)
  • [Article] How to Detect Under-Performing Segments in ML Models by Kavita Rana (NannyML.Com)
  • [Article] Building Custom Metrics for Predictive Maintenance by Kavita Rana(NannyML.Com)
  • [Article] 3 Custom Metrics for Your Forecasting Models by Kavita Rana (NannyML.Com)
  • [Article] There's Data Drift, But Does It Matter? by Santiago Víquez (NannyML.Com)
  • [Article] Monitoring Custom Metrics without Ground Truth by Kavita Rana (NannyML.Com)
  • [Article] Which Multivariate Drift Detection Method Is Right for You: Comparing DRE and DC by Miles Weberman (NannyML.Com)
  • [Article] Prevent Failure of Product Defect Detection Models: A Post-Deployment Guide by Kavita Rana (NannyML.Com)
  • [Article] Common Pitfalls in Monitoring Default Prediction Models and How to Fix Them by Miles Weberman (NannyML.Com)
  • [Article] Why Relying on Training Data for ML Monitoring Can Trick You by Kavita Rana (NannyML.Com)
  • [Article] Estimating Model Performance Without Labels by Jakub Białek (NannyML.Com)
  • [Article] Using Concept Drift as a Model Retraining Trigger by Taliya Weinstein (NannyML.Com)
  • [Article] Retraining is Not All You Need by Miles Weberman (NannyML.Com)
  • [Article] A Comprehensive Guide to Univariate Drift Detection Methods by Kavita Rana (NannyML.Com)
  • [Article] Stress-free Monitoring of Predictive Maintenance Models by Kavita Rana (NannyML.Com)
  • [Article] Effective ML Monitoring: A Hands-on Example by Miles Weberman (NannyML.Com)
  • [Article] Don’t Drift Away with Your Data: Monitoring Data Drift from Setup to Cloud by Taliya Weinstein (NannyML.Com)
  • [Article] Comparing Multivariate Drift Detection Algorithms on Real-World Data by Kavita Rana (NannyML.Com)
  • [Article] Detect Data Drift Using Domain Classifier in Python by Miles Weberman (NannyML.Com)
  • [Article] Guide: How to evaluate if NannyML is the right monitoring tool for you by Santiago Víquez (NannyML.Com)
  • [Article] How To Monitor ML models with NannyML SageMaker Algorithms by Wiljan Cools (NannyML.Com)
  • [Article] Tutorial: Monitoring Missing and Unseen values with NannyML by Santiago Víquez (NannyML.Com)
  • [Article] Monitoring Machine Learning Models: A Fundamental Practice for Data Scientists and Machine Learning Engineers by Saurav Pawar (Medium)
  • [Article] Failure Is Not an Option: How to Prevent Your ML Model From Degradation by Maciej Balawejder (Medium)
  • [Article] Managing Data Drift and Data Distribution Shifts in the MLOps Lifecycle for Machine Learning Models by Abhishek Reddy (Medium)
  • [Article] “You Can’t Predict the Errors of Your Model”… Or Can You? by Samuele Mazzanti (Medium)
  • [Article] Understanding Concept Drift: A Simple Guide by Vitor Cerqueira (Medium)
  • [Article] Detecting Covariate Shift: A Guide to the Multivariate Approach by Michał Oleszak (Medium)
  • [Article] Data Drift vs. Concept Drift: Differences and How to Detect and Address Them by DataHeroes Team (DataHeroes.AI)
  • [Article] An Introduction to Machine Learning Engineering for Production /MLOps — Concept and Data Drifts by Praatibh Surana (Medium)
  • [Article] Concept Drift and Model Decay in Machine Learning by Ashok Chilakapati (Medium)
  • [Article] Data Drift: Types of Data Drift by Numal Jayawardena (Medium)
  • [Article] Monitoring Machine Learning models by Jacques Verre (Medium)
  • [Article] Data drift: It Can Come At You From Anywhere by Tirthajyoti Sarkar (Medium)
  • [Article] Drift in Machine Learning by Piotr (Peter) Mardziel (Medium)
  • [Article] Understanding Dataset Shift by Matthew Stewart (Medium)
  • [Article] Calculating Data Drift in Machine Learning using Python by Vatsal (Medium)
  • [Article] 91% of ML Models Degrade in Time by Santiago Víquez (Medium)
  • [Article] Model Drift in Machine Learning by Kurtis Pykes (Medium)
  • [Article] Production Machine Learning Monitoring: Outliers, Drift, Explainers & Statistical Performance by Alejandro Saucedo (Medium)
  • [Article] How to Detect Model Drift in MLOps Monitoring by Amit Paka (Medium)
  • [Article] “My data drifted. What’s next?” How to handle ML model drift in production. by Elena Samuylova (Medium)
  • [Article] Machine Learning Model Drift by Sophia Yang (Medium)
  • [Article] Estimating the Performance of an ML Model in the Absence of Ground Truth by Eryk Lewinson (Medium)
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Stacking Machine Learning: Everything You Need to Know by Ada Parker (MachineLearningPro.Org)
  • [Article] Ensemble Learning: Bagging, Boosting and Stacking by Edouard Duchesnay, Tommy Lofstedt and Feki Younes (Duchesnay.GitHub.IO)
  • [Article] Stack Machine Learning Models: Get Better Results by Casper Hansen (Developer.IBM.Com)
  • [Article] GradientBoosting vs AdaBoost vs XGBoost vs CatBoost vs LightGBM by Geeks for Geeks Team (GeeksForGeeks.Org)
  • [Article] A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] The Ultimate Guide to AdaBoost Algorithm | What is AdaBoost Algorithm? by Ashish Kumar (MyGreatLearning.Com)
  • [Article] A Gentle Introduction to Ensemble Learning Algorithms by Jason Brownlee (MachineLearningMastery.Com)
  • [Article] Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results by Necati Demir (Toptal.Com)
  • [Article] The Essential Guide to Ensemble Learning by Rohit Kundu (V7Labs.Com)
  • [Article] Develop an Intuition for How Ensemble Learning Works by by Jason Brownlee (Machine Learning Mastery)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Ensemble Learning: Bagging, Boosting, Stacking by Ayşe Kübra Kuyucu (Medium)
  • [Article] Ensemble: Boosting, Bagging, and Stacking Machine Learning by Aleyna Şenozan (Medium)
  • [Article] Boosting, Stacking, and Bagging for Ensemble Models for Time Series Analysis with Python by Kyle Jones (Medium)
  • [Article] Different types of Ensemble Techniques — Bagging, Boosting, Stacking, Voting, Blending by Abhishek Jain (Medium)
  • [Article] Mastering Ensemble Techniques in Machine Learning: Bagging, Boosting, Bayes Optimal Classifier, and Stacking by Rahul Jain (Medium)
  • [Article] Understanding Ensemble Methods: Bagging, Boosting, and Stacking by Divya bhagat (Medium)
  • [Video Tutorial] Concept Drift Detection with NannyML | Webinar by NannyML (YouTube)
  • [Video Tutorial] Fooled by Data Drift: How to Monitor ML Without False Positives by NannyML (YouTube)
  • [Video Tutorial] Monitoring Custom Metrics Without Access to Targets by NannyML (YouTube)
  • [Video Tutorial] Analyzing Your Model's Performance in Production by NannyML (YouTube)
  • [Video Tutorial] How to Monitor Predictive Maintenance Models | Webinar Replay by NannyML (YouTube)
  • [Video Tutorial] Machine Learning Monitoring Workflow [Webinar] by NannyML (YouTube)
  • [Video Tutorial] Monitoring Machine Learning Models on AWS | Webinar by NannyML (YouTube)
  • [Video Tutorial] Root Cause Analysis for ML Model Failure by NannyML (YouTube)
  • [Video Tutorial] Quantifying the Impact of Data Drift on Machine Learning Model Performance | Webinar by NannyML (YouTube)
  • [Video Tutorial] How to Detect Drift and Resolve Issues in Your Machine Learning Models? by NannyML (YouTube)
  • [Video Tutorial] Notebooks to Containers: Setting up Continuous (ML) Model Monitoring in Production by NannyML (YouTube)
  • [Video Tutorial] Performance Estimation using NannyML | Tutorial in Jupyter Notebook by NannyML (YouTube)
  • [Video Tutorial] What Is NannyML? Introducing Our Open Source Python Library by NannyML (YouTube)
  • [Video Tutorial] How to Automatically Retrain Your Models with Concept Drift Detection? by NannyML (YouTube)
  • [Video Tutorial] How to Use NannyML? Two Modes of Running Our Library by NannyML (YouTube)
  • [Video Tutorial] How to Integrate NannyML in Production? | Tutorial by NannyML (YouTube)
  • [Video Tutorial] Bringing Your Machine Learning Model to Production | Overview by NannyML (YouTube)
  • [Video Tutorial] Notebooks to Containers: Setting Up Continuous (ML) Model Monitoring in Production by NannyML (YouTube)
  • [Video Tutorial] ML Performance without Labels: Comparing Performance Estimation Methods (Webinar Replay) by NannyML (YouTube)
  • [Course] DataCamp Python Data Analyst Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Associate Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Python Data Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Engineer Certificate by DataCamp Team (DataCamp)
  • [Course] DataCamp Machine Learning Scientist Certificate by DataCamp Team (DataCamp)
  • [Course] IBM Data Analyst Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Data Science Professional Certificate by IBM Team (Coursera)
  • [Course] IBM Machine Learning Professional Certificate by IBM Team (Coursera)